Skip to main content

Understanding the concept of outlier and its relevance to the assessment of data quality: Probabilistic background theory

Abstract

In recent years an increasing interest in the studies on outlier can be observed, however, for the time being there exists no general definition of outlier. In the present paper we introduced a generic descriptive definition of outlier. We observed that the outlier problems had so far been treated in statistical way without paying proper attention to probabilistic-theoretic backgrounds. In view of this gap, we made an attempt to establish a probabilistic background theory. Within this framework, the large deviations are considered as probabilistic-theoretic model of outlier, and the interrelationship of the laws of large numbers, the central limit theorems and the large deviations are clarified. These considerations are specialized for the case of statistical sample, which is important from the point of view of the assessment of data quality. Some methodological and historical aspects of geodesy, geophysics and astronomy are mentioned, too. We revealed that the data analysis carried out by Kepler in the process of discovery of his famous elliptic law of planetary motion has relevance to the outlier problem. This methodologically interesting fact is a new result in the history of geosciences. We established that the accuracy of Chebyshev inequality increases as the deviation of the random variable involved from its expectation, increases. The possibility of application of Chebyshev inequality to the outlier problem is pointed out.

References

  • Aiton, E. J., Kepler’s second law of Planetary Motion, Isis, 60, 75–90, 1969.

    Article  Google Scholar 

  • Baarda, W., A testing procedure for use in geodetic networks, Publications on geodesy, Vol. 2, No. 5. Netherlands Geodetic Commission, Deft, 1968.

    Google Scholar 

  • Bachmann, P., Zahlentheorie: Die Analytische Zahlentheorie, Zweiter Theil, B. G. Teubner, Leipzig, 1894.

    Google Scholar 

  • Bahudar, R. and R. R. Rao, On deviations of sample mean, Annals of Mathematical Statistics, 31, 1015–1027, 1960.

    Article  Google Scholar 

  • Barnett, V. and T. Lewis, Outliers in Statistical data, Second Edition, John Wiley, New York, 1984.

    Google Scholar 

  • van Beek, P., An application of the Fourier methods to the problem of sharpening the Berry-Esseen inequality, Z. Wahrscheinlichkeitstheorie ver. Geb., 23, 187–196, 1972.

    Article  Google Scholar 

  • Bernoulli, D., The most probable choice between several discrepant observations and the formations therefrom of the most likely induction, Reprinted in Biometrika, 48, 1–18, 1961.

    Article  Google Scholar 

  • Bernoulli, J., Wahrscheinlichkeitsrechnung, Whilh. Engelmann, Leipzig, 1899.

    Google Scholar 

  • Berry, A. C., The accuracy of the Gaussian approximation to the sum of independent variates, Trans. Amer. Math. Soc., 49, 122–136, 1941.

    Article  Google Scholar 

  • Bickel, P. J. and A. M. Krieger, Extensions of Chebyshev’s inequality with applcations, Probability and Mathematical Statistics, 13, 293–310, 1992.

    Google Scholar 

  • Boscovich, R. J., De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura ejus ex exemplaria etiam sensorum impessa, Bononiensi Scientiarum et Artum Instuto Atque Academia Commentarii, 4, 353–396, 1757.

    Google Scholar 

  • Chebyshev, P. L., Des valeurs moyennes, Liouville’s, J. Math. Pures Appl., 12, 177–184, 1867.

    Google Scholar 

  • Cramèr, H., Sur un nouveau théorème-limite de la théorie des probabilités, Actualités Scientifiques et Industrielles, 736, 5–23, 1938.

    Google Scholar 

  • Detrekoi, A., On the taking of gross errors into consideration in processing measurement data, Geodezia es Kartgrafia, No. 3, 155–160, 1986 (in Hungarian).

    Google Scholar 

  • Dreyer, J. L. E., Tycho Brahe: a Picture of Scientific Life and Work in the XVIth Century, Black, Edinburgh, 1890.

    Google Scholar 

  • Esseen, C. G., On the Liapunov limit error in the theory of probability, Ark. Mat. Astr. Fys., 28, 1–19, 1942.

    Google Scholar 

  • Finney, R. L. and G. B. Thomas, Calculus, Addison-Wesley, New York, 1990.

    Google Scholar 

  • Gather, U., Outlier models and some related inferential issues, in The Exponential Distribution, edited by N. Balakrishnan and A. P. Basu, pp. 221–239, University of Missouri-Columbia, Gordon and Breach Publishers, 2000.

    Google Scholar 

  • Imanishi, Y., T. Higashi, and Y Fukuda, Calibration of the superconducting gravimeter T011 by parallel observation with the absolute gravimeter FG5#210—a Bayesian approach, Geophys. J. Int., 151, 867–878, 2002.

    Article  Google Scholar 

  • Khinchine, A. I., Sur la loi des grands nombres, Comptes rendus de l’Académie des Sciences, 189, 477–479, 1929.

    Google Scholar 

  • Knuth, D. E., “Big Omicron and big Omega and big Theta”, SIGACT News, Special Interest Group on Algorithms and Computation Theory, 8, 18–14, 1976.

    Google Scholar 

  • Kolmogorov, A. N., Sur la loi forte des grands nombres, Comptes rendus de l’Académie des Sciences, 191, 910–912, 1930.

    Google Scholar 

  • Kolmogorov, A. N., Foundations of the Theory of Probability, Chelsea, New York, 1950.

    Google Scholar 

  • Kubik, K., W. Weng, and P. Frederiksen, Oh, Gross Errors!, Australian Journal of Geodesy, Photogrammetry and Surveying, 42, 1–18, 1985.

    Google Scholar 

  • Landau, E., Vorlesungen über Zahlentheorie: Aus der Analytischen und geometrischen Zahlentheorie, Zweiter Band, Hirzel, Leipzig, 1927.

    Google Scholar 

  • Laplace, P. S., Memoire sur les approximations des formules qui sont fonctions de tres grands nombres et sur leur applications aux probabilités, Mémoires de l’Académie des Sciences de Paris, 353–415, Supplement 559–569, 1810.

    Google Scholar 

  • Laplace, P. S., Théorie Analytique des Probabilités, Gauthier-Villars, Paris 1st ed., 1812., 2nd ed., 1814 and 3rd ed., 1820.

    Google Scholar 

  • Legendre, A. M., Méthods des moindres carrés, pour trouver le milien le plus probable entre les résultats de differéntes observations, Mem. Inst. de France, 149–154, 1810.

    Google Scholar 

  • Linnik, Y. V., On the probability of large deviations for the sums of independent variables, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume II, 289–306, 1961.

    Google Scholar 

  • Maire, C. and R. J. Boscovich, De litteraria expeditione per pontificiam ditionem ad dimetiendas duas Meridiani gradus, et corrigendum mappam geographicam, jussu, et auspiciis Benedicti XIV pont. Max. Suscepta. Ramae, or its French translation, 1755.

    Google Scholar 

  • Maire, C. and R. J. Boscovich, Voyage Astronomique et Géographique dans l’Etal de l’Eglise, entrepis par l’Ordre et sous les Auspices du Pope Benoit XIV, pour mesurer deux degrés du méridien, et corriger la Carte de l’Etat ecclesiastique, Paris, 1770.

    Google Scholar 

  • Monhor, D. and S. Takemoto, Geodetic and astronomical contributions to the invention of the normal distribution: some refinements and new evidences, J. Geod. Soc. Japan, 2004 (submitted).

    Google Scholar 

  • Nagaev, S. V, Large deviations of sums of independent random variables, The Annals of Probability, 7, 745–789, 1979.

    Article  Google Scholar 

  • O’Gorman, M. A. and R. H. Myers, Measures of errors with outlier in regression, Comm. Statist. Simula., 16, 771–789, 1987.

    Article  Google Scholar 

  • Pearson, K., James Brenoulli’s theorem, Biometrika, 17, 202–211, 1925.

    Google Scholar 

  • Plackett, R. L., The principle of the arithmetic mean, Biometrika, 45, 130–135, 1958.

    Article  Google Scholar 

  • Poisson, S. D., Recherches sur la Probabilité des Jugements en Matière Criminalle et en Matière Civile, précedées des Règles Genérales du Calcul des Probabilités, Bachelier, Paris, 1837. Translated into German by C. H. Schnuse under the title: Lehrbuch der Wahrscheinlichkeitsrechung und deren wichtigen Anwendungen, Braunschweig, 1841.

    Google Scholar 

  • Srikantan, K. S., Testing the outlier in a regression model, Sankhya, A, 23, 251–260, 1961.

    Google Scholar 

  • Stefansky, W., Rejecting outliers by maximum normal residual, The Annals of Statistics, 42, 35–45, 1971.

    Article  Google Scholar 

  • Stefansky, W., Rejecting outliers in factorial designs, Tecnnometics, 14, 469–479, 1972.

    Article  Google Scholar 

  • Todhunter, I., A History of the Mathematical Theories of Attraction and the Figure of the Earth, in two volumes, Macmillan and Co., London, 1873.

    Google Scholar 

  • Wilks, S. S., Mathematical Statistics, Wiley, New York, 1962.

    Google Scholar 

  • Wilks, S.S., Statistical inference in geology, The Earth Sciences: Problems and Progress in Current Research, edited by T. W. Donnelly, Rice University Semicentenial Publications, pp. 105–136, 1963.

    Google Scholar 

  • Wilson, C, Kepler’s derivation of the elliptic path, Isis, 59, 4–25, 1968.

    Article  Google Scholar 

  • Zolotarev, V M., A sharpening of the inequality of Berry-Esseen, Z. Wahrscheinlichkeitstheorie ver. Geb., 8, 332–342, 1967.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davaadorjin Monhor.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Monhor, D., Takemoto, S. Understanding the concept of outlier and its relevance to the assessment of data quality: Probabilistic background theory. Earth Planet Sp 57, 1009–1018 (2005). https://doi.org/10.1186/BF03351881

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1186/BF03351881

Key words

  • Assessment of data quality
  • Berry-Esseen theorem
  • Chebyshev inequality
  • large deviations
  • outliers