Estimation of Missing Values in the Data Mining and Comparison of Imputation Methods
DOI:
https://doi.org/10.15415/mjis.2013.12015Keywords:
Missing values, imputation methods, non parametric, data mining, Missing values in data miningAbstract
Many existing, industrial, and research data sets contain missing values (MVs). There are various reasons for their existence, such as manual data entry procedures, equipment errors, and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data are prepared and cleaned,
in order to be useful and sufficiently clear for the knowledge extraction process. MVs make the performance of data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, in the appropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study and can also limit the generalize ability of the research findings. The various types of problem are usually associated with MVs in data mining are (1) loss of efficiency;(2) complications in handling and analyzing the data; and(3) bias resulting from differences between missing and complete data. We will focus our attention on the use of imputation methods. A fundamental advantage of this approach is that the MV treatment is independent of the learning algorithm used. For this reason, the user can select the most appropriate method for each situation he faces. In this paper, different methods of estimation of missing values are discussed. The comparison of different imputation methods is given by using non-parametric methods.
Downloads
References
Acuna E, Rodriguez C (2004) Classification, clustering and data mining applications. Springer, Berlin, pp 639–648. http://dx.doi.org/10.1007/978-3-642-17103-1_60
Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/
Batista G, MonardM (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5):519–533. http://dx.doi.org/10.1080/713827181
C.E. Shannon, A Mathematical Theory of Communication, Bell Systems Technical Journal, vol.27, pp.379-423, 1948
Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1997, 1-38.
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
English, L. P., “Help for data quality problems -- A number of automated tools can ease data cleansing and help improve data quality,” InformationWeek, Oct 7, 1996, 53.
English, L. P., Information quality for business intelligence and data mining: Assuring quality for strategic information uses, 2005.<http://support.sas.com/news/users/LarryEnglish_0206.pdf> [retrieved April 1, 2007].
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A 37(5):692–709. http://dx.doi.org/10.1109/TSMCA.2007.902631
Garvin, D. A., Managing Quality, The Free Press, New York, 1988.
Hruschka ER Jr., Hruschka ER, Ebecken NF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3):231–252. http://dx.doi.org/10.1007/s10844-006-0016-x
Huang, K. T., Lee, Y. W., Wang, R. Y., Quality Information and Knowledge, Prentice-Hall, New York, 1999.
J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993 http://dx.doi.org/10.1023/A:1022645310020
J.R. Quinlan, Induction of Decision Trees, Machine Learning, vol.1, pp.81-106, 1986 http://dx.doi.org/10.1023/A:1022643204877
K. J. Cios, L.A. Kurgan, Hybrid Inductive Machine Learning: An Overview of CLIP Algorithms, In: L.C. Jain, and J. Kacprzyk, (Eds.), New Learning Paradigms in Soft Computing, pp. 276-322, Physica-Verlag (Springer), 2001 http://dx.doi.org/10.1007/978-3-7908-1803-1_10
K. Y. TAM and M. Y. KIANG (1992) Managerial applications of neural networks: The case of bank failure predictions. Mgmt Sci. 38, 936-947. http://dx.doi.org/10.1287/mnsc.38.7.926
Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology 12, 1-16.
Kim H, Golub GH, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198 http://dx.doi.org/10.1093/bioinformatics/bth499
L. M. SALCHENBERGERE,. M. CINAR and N. A. LASH (1992) Neural networks: A new tool for predicting thrift failures. Decis. Sci. 23, 899-916. http://dx.doi.org/10.1111/j.1540-5915.1992.tb00425.x
Little, R. J. A., and Rubin, D. B., Statistical Analysis with Missing Data, 2nd Ed. New York: John Wiley and Sons, 2002.
N. CAPON (1982) Credit scoring systems: A critical analysis. J. Marketing 41, 82-91. http://dx.doi.org/10.2307/3203343
R. A. WALKING (1985) Predicting tender offer success: A logistic analysis. J. Finance and Quantitative Analysis 20, 461-478. http://dx.doi.org/10.2307/2330762
Rayward-Smith V.J Statistics to measure correlation for data mining applications Computational Statistics & Data Analysis 51 (2007) 3968 – 3982 http://dx.doi.org/10.1016/j.csda.2006.05.025
R. Y. AWH and D. WALTERS (1974) A discriminant analysis of economic, demographic, and attitudinal characteristics of bank charge-card holders: A case study. J. Finance. 29, 973-980. http://dx.doi.org/10.1111/j.1540-6261.1974.tb01495.x
R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley, 1977
Salaun, Y. and Flores, K., “Information quality: meeting the needs of the consumer,” International Journal of Information Management, 21(1), 2001, 21-37. http://dx.doi.org/10.1016/S0268-4012(00)00048-7
Salmela, H., “From information systems quality to sustainable business quality,” Information and Software Technology, 39(12), 1997, 819-825. http://dx.doi.org/10.1016/S0950-5849(97)00040-2
Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81(12):2361–2370 http://dx.doi.org/10.1016/j.jss.2008.05.008
Strong, D., Lee, Y. W., and Wang, R. Y., 10 potholes in the road to information quality, IEEE Computer, 30(8), 1997, 38-46. http://dx.doi.org/10.1109/2.607057
Tozer, G., Metadata Management for Information Control and Business Success, Artech House, Norwood, MA, 1999.
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405 http://dx.doi.org/10.1080/08839510902872223
Wang, H, and Wang, S., Data mining with incomplete data, in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), Idea Group Inc.: Hershey, PA, 2005, pp.293-296.
Wang, R. Y., Lee, Y. W., Pipino, L. L., and Strong, D. M., “Manage your information as a product,” Sloan Management Review, 39(4), 1998, 95-105.
Downloads
Published
How to Cite
Issue
Section
License
Articles in Mathematical Journal of Interdisciplinary Sciences (Math. J. Interdiscip. Sci.) by Chitkara University Publications are Open Access articles that are published with licensed under a Creative Commons Attribution- CC-BY 4.0 International License. Based on a work at https://mjis.chitkara.edu.in. This license permits one to use, remix, tweak and reproduction in any medium, even commercially provided one give credit for the original creation.
View Legal Code of the above mentioned license, https://creativecommons.org/licenses/by/4.0/legalcode
View Licence Deed here https://creativecommons.org/licenses/by/4.0/
Mathematical Journal of Interdisciplinary Sciences by Chitkara University Publications is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at https://mjis.chitkara.edu.in |