Skip to main content

Advertisement

Understanding the concept of outlier and its relevance to the assessment of data quality: Probabilistic background theory

Article metrics

  • 441 Accesses

  • 8 Citations

Abstract

In recent years an increasing interest in the studies on outlier can be observed, however, for the time being there exists no general definition of outlier. In the present paper we introduced a generic descriptive definition of outlier. We observed that the outlier problems had so far been treated in statistical way without paying proper attention to probabilistic-theoretic backgrounds. In view of this gap, we made an attempt to establish a probabilistic background theory. Within this framework, the large deviations are considered as probabilistic-theoretic model of outlier, and the interrelationship of the laws of large numbers, the central limit theorems and the large deviations are clarified. These considerations are specialized for the case of statistical sample, which is important from the point of view of the assessment of data quality. Some methodological and historical aspects of geodesy, geophysics and astronomy are mentioned, too. We revealed that the data analysis carried out by Kepler in the process of discovery of his famous elliptic law of planetary motion has relevance to the outlier problem. This methodologically interesting fact is a new result in the history of geosciences. We established that the accuracy of Chebyshev inequality increases as the deviation of the random variable involved from its expectation, increases. The possibility of application of Chebyshev inequality to the outlier problem is pointed out.

References

  1. Aiton, E. J., Kepler’s second law of Planetary Motion, Isis, 60, 75–90, 1969.

  2. Baarda, W., A testing procedure for use in geodetic networks, Publications on geodesy, Vol. 2, No. 5. Netherlands Geodetic Commission, Deft, 1968.

  3. Bachmann, P., Zahlentheorie: Die Analytische Zahlentheorie, Zweiter Theil, B. G. Teubner, Leipzig, 1894.

  4. Bahudar, R. and R. R. Rao, On deviations of sample mean, Annals of Mathematical Statistics, 31, 1015–1027, 1960.

  5. Barnett, V. and T. Lewis, Outliers in Statistical data, Second Edition, John Wiley, New York, 1984.

  6. van Beek, P., An application of the Fourier methods to the problem of sharpening the Berry-Esseen inequality, Z. Wahrscheinlichkeitstheorie ver. Geb., 23, 187–196, 1972.

  7. Bernoulli, D., The most probable choice between several discrepant observations and the formations therefrom of the most likely induction, Reprinted in Biometrika, 48, 1–18, 1961.

  8. Bernoulli, J., Wahrscheinlichkeitsrechnung, Whilh. Engelmann, Leipzig, 1899.

  9. Berry, A. C., The accuracy of the Gaussian approximation to the sum of independent variates, Trans. Amer. Math. Soc., 49, 122–136, 1941.

  10. Bickel, P. J. and A. M. Krieger, Extensions of Chebyshev’s inequality with applcations, Probability and Mathematical Statistics, 13, 293–310, 1992.

  11. Boscovich, R. J., De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura ejus ex exemplaria etiam sensorum impessa, Bononiensi Scientiarum et Artum Instuto Atque Academia Commentarii, 4, 353–396, 1757.

  12. Chebyshev, P. L., Des valeurs moyennes, Liouville’s, J. Math. Pures Appl., 12, 177–184, 1867.

  13. Cramèr, H., Sur un nouveau théorème-limite de la théorie des probabilités, Actualités Scientifiques et Industrielles, 736, 5–23, 1938.

  14. Detrekoi, A., On the taking of gross errors into consideration in processing measurement data, Geodezia es Kartgrafia, No. 3, 155–160, 1986 (in Hungarian).

  15. Dreyer, J. L. E., Tycho Brahe: a Picture of Scientific Life and Work in the XVIth Century, Black, Edinburgh, 1890.

  16. Esseen, C. G., On the Liapunov limit error in the theory of probability, Ark. Mat. Astr. Fys., 28, 1–19, 1942.

  17. Finney, R. L. and G. B. Thomas, Calculus, Addison-Wesley, New York, 1990.

  18. Gather, U., Outlier models and some related inferential issues, in The Exponential Distribution, edited by N. Balakrishnan and A. P. Basu, pp. 221–239, University of Missouri-Columbia, Gordon and Breach Publishers, 2000.

  19. Imanishi, Y., T. Higashi, and Y Fukuda, Calibration of the superconducting gravimeter T011 by parallel observation with the absolute gravimeter FG5#210—a Bayesian approach, Geophys. J. Int., 151, 867–878, 2002.

  20. Khinchine, A. I., Sur la loi des grands nombres, Comptes rendus de l’Académie des Sciences, 189, 477–479, 1929.

  21. Knuth, D. E., “Big Omicron and big Omega and big Theta”, SIGACT News, Special Interest Group on Algorithms and Computation Theory, 8, 18–14, 1976.

  22. Kolmogorov, A. N., Sur la loi forte des grands nombres, Comptes rendus de l’Académie des Sciences, 191, 910–912, 1930.

  23. Kolmogorov, A. N., Foundations of the Theory of Probability, Chelsea, New York, 1950.

  24. Kubik, K., W. Weng, and P. Frederiksen, Oh, Gross Errors!, Australian Journal of Geodesy, Photogrammetry and Surveying, 42, 1–18, 1985.

  25. Landau, E., Vorlesungen über Zahlentheorie: Aus der Analytischen und geometrischen Zahlentheorie, Zweiter Band, Hirzel, Leipzig, 1927.

  26. Laplace, P. S., Memoire sur les approximations des formules qui sont fonctions de tres grands nombres et sur leur applications aux probabilités, Mémoires de l’Académie des Sciences de Paris, 353–415, Supplement 559–569, 1810.

  27. Laplace, P. S., Théorie Analytique des Probabilités, Gauthier-Villars, Paris 1st ed., 1812., 2nd ed., 1814 and 3rd ed., 1820.

  28. Legendre, A. M., Méthods des moindres carrés, pour trouver le milien le plus probable entre les résultats de differéntes observations, Mem. Inst. de France, 149–154, 1810.

  29. Linnik, Y. V., On the probability of large deviations for the sums of independent variables, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume II, 289–306, 1961.

  30. Maire, C. and R. J. Boscovich, De litteraria expeditione per pontificiam ditionem ad dimetiendas duas Meridiani gradus, et corrigendum mappam geographicam, jussu, et auspiciis Benedicti XIV pont. Max. Suscepta. Ramae, or its French translation, 1755.

  31. Maire, C. and R. J. Boscovich, Voyage Astronomique et Géographique dans l’Etal de l’Eglise, entrepis par l’Ordre et sous les Auspices du Pope Benoit XIV, pour mesurer deux degrés du méridien, et corriger la Carte de l’Etat ecclesiastique, Paris, 1770.

  32. Monhor, D. and S. Takemoto, Geodetic and astronomical contributions to the invention of the normal distribution: some refinements and new evidences, J. Geod. Soc. Japan, 2004 (submitted).

  33. Nagaev, S. V, Large deviations of sums of independent random variables, The Annals of Probability, 7, 745–789, 1979.

  34. O’Gorman, M. A. and R. H. Myers, Measures of errors with outlier in regression, Comm. Statist. Simula., 16, 771–789, 1987.

  35. Pearson, K., James Brenoulli’s theorem, Biometrika, 17, 202–211, 1925.

  36. Plackett, R. L., The principle of the arithmetic mean, Biometrika, 45, 130–135, 1958.

  37. Poisson, S. D., Recherches sur la Probabilité des Jugements en Matière Criminalle et en Matière Civile, précedées des Règles Genérales du Calcul des Probabilités, Bachelier, Paris, 1837. Translated into German by C. H. Schnuse under the title: Lehrbuch der Wahrscheinlichkeitsrechung und deren wichtigen Anwendungen, Braunschweig, 1841.

  38. Srikantan, K. S., Testing the outlier in a regression model, Sankhya, A, 23, 251–260, 1961.

  39. Stefansky, W., Rejecting outliers by maximum normal residual, The Annals of Statistics, 42, 35–45, 1971.

  40. Stefansky, W., Rejecting outliers in factorial designs, Tecnnometics, 14, 469–479, 1972.

  41. Todhunter, I., A History of the Mathematical Theories of Attraction and the Figure of the Earth, in two volumes, Macmillan and Co., London, 1873.

  42. Wilks, S. S., Mathematical Statistics, Wiley, New York, 1962.

  43. Wilks, S.S., Statistical inference in geology, The Earth Sciences: Problems and Progress in Current Research, edited by T. W. Donnelly, Rice University Semicentenial Publications, pp. 105–136, 1963.

  44. Wilson, C, Kepler’s derivation of the elliptic path, Isis, 59, 4–25, 1968.

  45. Zolotarev, V M., A sharpening of the inequality of Berry-Esseen, Z. Wahrscheinlichkeitstheorie ver. Geb., 8, 332–342, 1967.

Download references

Author information

Correspondence to Davaadorjin Monhor.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Key words

  • Assessment of data quality
  • Berry-Esseen theorem
  • Chebyshev inequality
  • large deviations
  • outliers