Estimation of Missing Values in the Data Mining and Comparison of Imputation Methods

Shamsher Singh and Jagdish Prasad


Missing values, imputation methods, non parametric, data mining.


Many existing, industrial, and research data sets contain missing values (MVs). There are various reasons for their existence, such as manual data entry procedures, equipment errors, and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data are prepared and cleaned , in order to be useful to and sufficiently clear for the knowledge extraction process. MVs make the performance of data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, in appropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study and can also limit the generalize ability of the research findings. The various types of problem are usually associated with MVs in data mining are (1) loss of efficiency;(2) complications in handling and analyzing the data; and(3) bias resulting from differences between missing and complete data. We will focus our attention on the use of imputation methods. A fundamental advantage of this approach is that the MV treatment is independent of the learning algorithm used. For this reason, the user can select the most appropriate method for each situation he faces. In this paper different methods of estimation of missing values are discussed. The comparison of different imputation methods are given by using non parametric methods.

URL http://dspace.chitkara.edu.in/jspui/bitstream/1/112/1/12015_MJIS_Prasad.pdf
DOI 10.15415/mjis.2013.12015
  • Acuna E, Rodriguez C (2004) Classification, clustering and data mining applications. Springer, Berlin, pp 639–648
  • Asuncion A, Newman D (2007) UCI machine learning repository.
  • http://archive.ics.uci.edu/ml/ Batista G, MonardM (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5):519–533
  • C.E. Shannon, A Mathematical Theory of Communication, Bell Systems Technical Journal, vol.27, pp.379-423, 1948
  • Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1997, 1-38.
  • Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
  • English, L. P., “Help for data quality problems -- A number of automated tools can ease data cleansing and help improve data quality,” InformationWeek, Oct 7, 1996, 53.
  • English, L. P., Information quality for business intelligence and data mining: Assuring quality for strategic information uses, 2005. <http://support.sas.com/news/users/LarryEnglish_0206.pdf> [retrieved April 1, 2007].
  • Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A 37(5):692–709 Garvin, D. A., Managing Quality , The Free Press, New York, 1988.
  • Hruschka ER Jr., Hruschka ER, Ebecken NF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3):231–252
  • Huang, K. T., Lee, Y. W., Wang, R. Y., Quality Information and Knowledge , Prentice-Hall, New York, 1999.
  • J.R. Quinlan, C4.5: Programs for Machine Learning , Morgan Kaufman, 1992
  • J.R. Quinlan, Induction of Decision Trees, Machine Learning , vol.1, pp.81-106, 1986
  • K. J. Cios, L.A. Kurgan, Hybrid Inductive Machine Learning: An Overview of CLIP Algorithms, In: L.C. Jain, and J. Kacprzyk, (Eds.), New Learning Paradigms in Soft Computing , pp. 276-322,
  • Physica-Verlag (Springer), 2001
  • K. Y. TAM and M. Y. KIANG (1992) Managerial applications of neural networks: The case of bank failure predictions. Mgmt Sci. 38, 936-947.
  • Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology 12, 1-16.
  • Kim H, Golub GH, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
  • L. M. SALCHENBERGERE,. M. CINAR and N. A. LASH (1992) Neural networks: A new tool for predicting thrift failures. Decis. Sci. 23, 899-916.
  • Little, R. J. A., and Rubin, D. B., Statistical Analysis with Missing Data, 2nd Ed . New York: John Wiley and Sons, 2002.
  • N. CAPON (1982) Credit scoring systems: A critical analysis. J. Marketing 41, 82-91.
  • R. A. WALKING (1985) Predicting tender offer success: A logistic analysis. J. Finance and Quantitative Analysis 20, 461-478.
  • Rayward-Smith V.J Statistics to measure correlation for data mining applications Computational Statistics & Data Analysis 51 (2007) 3968 – 3982
  • R. Y. AWH and D. WALTERS (1974) A discriminant analysis of economic, demographic, and attitudinal characteristics of bank charge-card holders: A case study. J. Finance. 29, 973-980.
  • R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis , John Wiley, 1977 Salaun, Y. and Flores, K., “Information quality: meeting the needs of the consumer,” International Journal of Information Management , 21(1), 2001, 21-37.
  • Salmela, H., “From information systems quality to sustainable business quality,” Information and Software Technology , 39(12), 1997, 819-825.
  • Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81(12):2361– 2370
  • Strong, D., Lee, Y. W., and Wang, R. Y., 10 potholes in the road to information quality, IEEE Computer,30(8), 1997, 38-46.
  • Tozer, G., Metadata Management for Information Control and Business Success , Artech House, Norwood, MA, 1999.
  • Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405
  • Wang, H, and Wang, S., Data mining with incomplete data, in Encyclopedia of Data Warehousing and Mining , John Wang (Ed.), Idea Group Inc.: Hershey, PA, 2005, pp.293-296.
  • Wang, R. Y., Lee, Y. W., Pipino, L. L., and Strong, D. M., “Manage your information as a product,” Sloan Management Review , 39(4), 1998, 95-105.