Feature selection and classification model construction on type 2 diabetic patients’ data.

Yue Huang, PJ McCullagh, Norman Black, Roy Harper

Research output: Contribution to journalArticle

88 Citations (Scopus)

Abstract

SummaryObjectiveDiabetes affects between 2% and 4% of the global population (up to 10% in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection to a working patient management system to assist with ranking, classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors.MethodsThe diabetic patients’ information was collected by Ulster Community and Hospitals Trust (UCHT) from year 2000 to 2004 as part of clinical management. In order to discover key predictors and latent knowledge, data mining techniques were applied. To improve computational efficiency, a feature selection technique, feature selection via supervised model construction (FSSMC), an optimisation of ReliefF, was used to rank the important attributes affecting diabetic control. After selecting suitable features, three complementary classification techniques (Naïve Bayes, IB1 and C4.5) were applied to the data to predict how well the patients’ condition was controlled.ResultsFSSMC identified patients’ ‘age’, ‘diagnosis duration’, the need for ‘insulin treatment’, ‘random blood glucose’ measurement and ‘diet treatment’ as the most important factors influencing blood glucose control. Using the reduced features, a best predictive accuracy of 95% and sensitivity of 98% was achieved. The influence of factors, such as ‘type of care’ delivered, the use of ‘home monitoring’, and the importance of ‘smoking’ on outcome can contribute to domain knowledge in diabetes control.ConclusionIn the care of patients with diabetes, the more important factors identified: patients’ ‘age’, ‘diagnosis duration’ and ‘family history’, are beyond the control of physicians. Treatment methods such as ‘insulin’, ‘diet’ and ‘tablets’ (a variety of oral medicines) may be controlled. However lifestyle indicators such as ‘body mass index’ and ‘smoking status’ are also important and may be controlled by the patient. This further underlines the need for public health education to aid awareness and prevention. More subtle data interactions need to be better understood and data mining can contribute to the clinical evidence base. The research confirms and to a lesser extent challenges current thinking. Whilst fully appreciating the requirement for clinical verification and interpretation, this work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record. FSSMC has proved a useful feature estimator for large data sets, where processing efficiency is an important factor.
LanguageEnglish
Pages251-262
JournalArtif. Intell. Med.
Volume41
Issue number3
DOIs
Publication statusPublished - 2007

Fingerprint

Feature extraction
Medical problems
Data Mining
Data mining
Insulin
Public health
Nutrition
Glucose
Blood Glucose
Blood
Public Health
Smoking
Health
Diet
Oral Medicine
Explosions
Monitoring
Electronic Health Records
Information Storage and Retrieval
Community Hospital

Keywords

  • Type 2 diabetes
  • Blood glucose
  • Data mining
  • Classification
  • Feature selection

Cite this

Huang, Yue ; McCullagh, PJ ; Black, Norman ; Harper, Roy. / Feature selection and classification model construction on type 2 diabetic patients’ data. In: Artif. Intell. Med. 2007 ; Vol. 41, No. 3. pp. 251-262.
@article{d975eba6679f4190a75c584b5efa458c,
title = "Feature selection and classification model construction on type 2 diabetic patients’ data.",
abstract = "SummaryObjectiveDiabetes affects between 2{\%} and 4{\%} of the global population (up to 10{\%} in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection to a working patient management system to assist with ranking, classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors.MethodsThe diabetic patients’ information was collected by Ulster Community and Hospitals Trust (UCHT) from year 2000 to 2004 as part of clinical management. In order to discover key predictors and latent knowledge, data mining techniques were applied. To improve computational efficiency, a feature selection technique, feature selection via supervised model construction (FSSMC), an optimisation of ReliefF, was used to rank the important attributes affecting diabetic control. After selecting suitable features, three complementary classification techniques (Na{\"i}ve Bayes, IB1 and C4.5) were applied to the data to predict how well the patients’ condition was controlled.ResultsFSSMC identified patients’ ‘age’, ‘diagnosis duration’, the need for ‘insulin treatment’, ‘random blood glucose’ measurement and ‘diet treatment’ as the most important factors influencing blood glucose control. Using the reduced features, a best predictive accuracy of 95{\%} and sensitivity of 98{\%} was achieved. The influence of factors, such as ‘type of care’ delivered, the use of ‘home monitoring’, and the importance of ‘smoking’ on outcome can contribute to domain knowledge in diabetes control.ConclusionIn the care of patients with diabetes, the more important factors identified: patients’ ‘age’, ‘diagnosis duration’ and ‘family history’, are beyond the control of physicians. Treatment methods such as ‘insulin’, ‘diet’ and ‘tablets’ (a variety of oral medicines) may be controlled. However lifestyle indicators such as ‘body mass index’ and ‘smoking status’ are also important and may be controlled by the patient. This further underlines the need for public health education to aid awareness and prevention. More subtle data interactions need to be better understood and data mining can contribute to the clinical evidence base. The research confirms and to a lesser extent challenges current thinking. Whilst fully appreciating the requirement for clinical verification and interpretation, this work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record. FSSMC has proved a useful feature estimator for large data sets, where processing efficiency is an important factor.",
keywords = "Type 2 diabetes, Blood glucose, Data mining, Classification, Feature selection",
author = "Yue Huang and PJ McCullagh and Norman Black and Roy Harper",
note = "Reference text: [1] Gan D, editor. Diabetes atlas, 2nd ed. Brussels: International Diabetes Federation; 2003. http://www.eatlas.idf.org/ webdata/docs/Atlas{\%}202003-Summary.pdf (accessed June 19, 2007). [2] Alberti K, Zimmet P. Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1. Diagnosis and classification of diabetes mellitus–—provisional report of a WHO Consultation. Diabetic Med 1998;15:539—53. [3] Guthrie RA, Guthrie DW, editors. Nursing management of diabetes mellitus. 5th ed., New York: Springer Publishing; 2002. [4] Pinhas-Hamiel O, Zeitler P. Acute and chronic complications of type 2 diabetes mellitus in children and adolescents. Lancet 2007;369:1823—31. [5] Pickup JC, Williams G, editors. Textbook of diabetes. 3rd ed., Oxford: Blackwell Science; 2003. [6] Lorig K, Holman H. Self management education: history, definition and outcomes and mechanisms. Ann Behav Med 2003;26(1):1—7. doi:10.1207/S15324796ABM2601_01. [7] Smith R. Improving the management of chronic disease. Br Med J 2003;327. doi:10.1136/bmj.327.7405.12. [8] Department of Health. Supporting people with long term conditions: an NHS and social care model to support local innovation and integration. London: Department of Health; Crown copyright 2005. [9] Department of Health. Self care: a real choice. London: Department of Health; Crown copyright 2005. [10] Nissen SE, Wolski K. Effect of rosiglitazone on the risk of myocardial infarction and death. N Engl J Med 2007;365. doi:10.1056/NEJMoa072761. [11] Dash M, Liu H. Consistency-based search in feature selection. Artif Intell 2003;151:155—76. [12] Lavrac N. Data mining in medicine: selected techniques and applications. In: Proceedings of the second international conference on the practical application of knowledge discovery and data mining. London: The Practical Applications Company; 1998. p. 11—31. [13] Mitchell M, editor. Machine learning. New York: McGraw- Hill; 1997. [14] Martin B. Instance-based learning: nearest neighbour with generalisation. PhD thesis. Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1995. [15] Lewis D, Gale W. A sequential algorithm for training text classifiers. In: Croft BW, Rijsbergen CJ, editors. Proceedings of the seventeenth annual ACM-SIGIR conference on research and development in information retrieval. Springer-Verlag; 1994. p. 3—12. [16] Rish I, Hellerstein J, Thathachar J. An analysis of data characteristics that affect Naı¨ve Bayes performance. New York. IBM Technical Report; 2002. http://www.research.ibm.com/ PM/icml01.pdf (accessed June 19, 2007). [17] Topon KP. Gene expression based cancer classification using evolutionary and non-evolutionary methods. Technical Report No. 041105A1. Japan: Department of Frontier Informatics, The University of Tokyo; 2004. [18] Cornforth D, Jelinek H, Peichl L. Fractop: a tool for automated biological image classification. In: Sarker, McKay, Gen, Namatame, editors. Proceedings of the sixth Australia— Japan joint workshop on intelligent and evolutionary systems. 2002. p. 141—8. [19] Aires R, Manfrin A, Aluisio S, Santos D. Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users needs? Technical Report NILC-TR-04-09. Brasil: University de Sao Paulo; 2004. [20] Hall M. Correlation-based feature selection for machine learning. PhD thesis. Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1999. http:// www.cs.waikato.ac.nz/�mhall/thesis.pdf (accessed June 19, 2007). [21] Inza I, Sierra B, Blanco R, Larranaga P. Gene selection by sequential search wrapper approaches in microarry cancer class prediction. J Intell Fuzzy Syst 2002;12(1):25—32. [22] Hall M, Holmes G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowledge Data Eng 2003;15:1437—47. [23] Sierra B, Lazkano E. Probabilistic-weighted k-nearest neighbour algorithm: a new approach for gene expression-based classification. Knowledge-Based Intell Inf Eng 2003;932—9. [24] Su CT, Yang CH, Hsu KH, Chiu WK. Data mining for the diagnosis for type II diabetes from three-dimensional body surface anthropometrical scanning data. Comput Math Appl 2006;51:1075—92. [25] Huang Y, McCullagh PJ, Black ND. Feature selection via supervised model construction. In: Bramer M, editor. Proceedings of the 4th IEEE international conference on data mining. 2004. p. 411—4. [26] Kononenko I. Estimating attributes: analysis and extension of relief. In: Proceedings of the seventh European conference in machine learning. Springer-Verlag; 1994 . p. 171—82. [27] Demsar J, Zupan B, Aoki N, Wall M, Granchi T, Beck J. Feature mining and predictive model construction from severe trauma patient’s data. Int J Med Inf 2001;63:41—50. [28] Kononenko I, Simec E. Induction of decision trees with RELIEFF. In: Proceedings of ISSEK workshop on mathematical and statistical methods in artificial intelligence. New York: Springer; 1995. p. 199—220. [29] Robnik M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 2003;53:23—69. [30] Fayyad U, Piatesky-Shapiro G, Smyth P, editors. Advances in knowledge discovery and data mining. AAAI/MIT Press; 1996. [31] Kauderer K, Mucha H, editors. Classification, data analysis and data highways. New York: Springer-Verlag; 1997. [32] Schohn G, Cohn D. Less is more: active learning with support vector machines. In: Pat Langley, editor. Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann; 2000. p. 839—46. [33] Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reduction. In: Brodley CE, Pohoreckyj Danyluk A, editors. Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann; 2001. p. 441—8. [34] Liu H, Motoda H, Yu L. A selective sampling approach to active feature selection. Artif Intell 2004;159:49—74. [35] Aha D, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn 1991;6:37—66. [36] Kantardzic M, editor. Data mining: concepts, models, methods, and algorithms. New Jersey: Wiley-IEEE Press; 2002. [37] Demsar J, Zupan B, Aoki N, Wall MJ, Granchi TH, Beck JR. Feature mining and predictive model construction from severe trauma patient’s data. Int J Med Inf Elsevier Science 2001;63:41—50. [38] Molina L, Belanche L, Nebot A. Feature selection algorithms: a survey and experimental evaluation. In: Proceeding of IEEE international conference on data mining, IEEE. 2002. p. 306—13. [39] van Bemmel J, Musen M, editors. Handbook of medical informatics. New York: Springer; 1997. [40] Perner P. Improving the accuracy of decision tree induction by feature pre-selection. Appl Artif Intell 2001;15(8):747—60. [41] Grzymala-Busse J. Data mining in bioinformatics. Technical Report. USA; University of Kansas; 2003. [42] Hall L, Collins R, Bowyer K, Banfield R. Error-based pruning of decision trees grown on very large data sets can work. In: Proceedings of 14th IEEE international conference on tools for artificial intelligence; 2002. p. 233—8. [43] Bennett P. Epidemiology of diabetes mellitus. In: Rifkin H, Porte D, editors. Ellenberg and Rifkin’s diabetes mellitus. New York: Elsevier; 1990. p. 363—77. [44] Croxson S, Burden A, Bodlington M, Bostha J. The prevalence of diabetes in elderly people. Diabetic Med 1991;8:28—31. [45] Newman B, Selby J, King M. Concordance for type 2 diabetes mellitus (NIDDM) in male twins. Diabetiologia 1987;30: 763—8. [46] Knowler W, Pettitt D, Saad M. Diabetes mellitus in the pima Indians: Incidence, risk factors and pathogenesis. Diabetes Metab Rev 1990;6:1—27. [47] Harris M. Epidemiological correlates of NIDDM in Hispanics, Whites, and Blacks in the US population. Diabetes Care 1991;14:639—48. [48] Marcovecchio M, Mohn A, Chiarelli F. Type 2 diabetes mellitus in children and adolescents. J Endocrinol Investig 2005;28:853—63. [49] Wong T, Barr E, Tapp R, Harper C, Taylor H, Zimmet P, et al. Retinopathy in persons with impaired glucose metabolism: the Australian diabetes obesity and lifestyle (AusDiab) study. Am J Ophthalmol 2005;140:1157—9. [50] Hansen B, Bodkin N. Primary prevention of diabetes mellitus by prevention of obesity in monkeys. Diabetes 1993;42: 1809—14. [51] Brug J, Campbell M, van Assema P. The application and impact of computer generated personalized nutrition education: a review of the literature. Patient Educ Counsel 1999;36:145—56. [52] Diabetes Prevention Program Research Group. Reduction in the incidence of Type II diabetes with lifestyle intervention or metformin. N Engl J Med 2002;346(6):393—403. [53] Franz MJ. The answer to weight loss is easy–—doing it is hard! Clin Diabetes 2001;19(3):105—9. [54] Vijan S, Hayward RA. Treatment of hypertension in Type 2 diabetes mellitus: blood pressure goals, choice of agents, and setting priorities in diabetes care. Ann Intern Med 2003;138:593—602. [55] The American College of Physicians. Blood pressure control in people with Type 2 diabetes mellitus: recommendations from the American College of Physicians. Ann Intern Med 2006;138:1—70. [56] Snow V, Weiss KB, Mottur-Pilson C. The evidence base for tight blood pressure control in the management of Type 2 diabetes mellitus. Ann Intern Med 2003;138:587—92. [57] Bakris G, Weir M, DeQuattro M, McManhon F. Effects of an ace inhibitor/calcium antagonist combination on proteinuria in diabetic nephropathy. Kidney Int 1998;54:1283—9. [58] Cheraskin E. The breakfast/lunch/dinner ritual. J Orthomol Med 1993;8:6—10. [59] West K, Ahuja M, Bennett B, Czyzyk A, DeAcosta O, Fuller J. The role of circulating glucose and triglyceride concentrations and their interactions with other ‘risk factors’ as determinants of arterial disease in nine diabetic population samples from the who multinational study. Diabetes Care 1983;6:361—9. [60] Standl E, Stiegler H, Janka H, Mehnert H. Risk profile of macrovascular disease in diabetes mellitus. Diabetes Metab 1988;14:505—11. [61] Fontbonne A, Thibult N, Eschwege E, Ducimetiere P. Body fat distribution and coronary heart disease mortality in subjects with impaired glucose tolerance or diabetes mellitus: the paris prospective study 15-year follow-up. Diabetologia 1992;35:464—8. [62] Rimm E, Chan J, Stampfer M, Colditz G, Willett W. Prospective study of cigarette smoking, alcohol use, and the risk of diabetes in men. Br Med J 1995;310:555—9. [63] Wannamethee, Shaper SA, Perry I. Smoking as a modifiable risk factor for type 2 diabetes in middle-aged men. Diabetes Care 2001;24:1590—5. [64] Sairenchi T, Iso H, Nishimura A, Hosoda T, Irie F. Cigarette smoking and risk of type 2 diabetes mellitus among middleaged and elderly Japanese men and women. Am J Epidemiol 2004;160:158—62. [65] Chen M, Han J, Yu P. Data mining: an overview from a database perspective. IEEE Trans Knowledge Data Eng 1996;8:866—83. [66] Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) Workshop support vector machines; 1999. p. 55—60. [67] Newman DJ, Hettich S, Blake CL, Merz CJ. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science; 1998. http://www.ics.uci.edu/�mlearn/MLRepository.html (accessed June 19, 2007). [68] Crone S, Lessmann S, Stahlbock R. Empirical comparison and evaluation of classifier performance for data mining in customer relationship management. In: Wunsch D, et al., editors. Proceedings of the international joint conference on neural networks, IJCNN’04. 2004. p. 443—8.",
year = "2007",
doi = "10.1016/j.artmed.2007.07.002",
language = "English",
volume = "41",
pages = "251--262",
journal = "Artificial Intelligence in Medicine",
issn = "0933-3657",
publisher = "Elsevier",
number = "3",

}

Feature selection and classification model construction on type 2 diabetic patients’ data. / Huang, Yue; McCullagh, PJ; Black, Norman; Harper, Roy.

In: Artif. Intell. Med., Vol. 41, No. 3, 2007, p. 251-262.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Feature selection and classification model construction on type 2 diabetic patients’ data.

AU - Huang, Yue

AU - McCullagh, PJ

AU - Black, Norman

AU - Harper, Roy

N1 - Reference text: [1] Gan D, editor. Diabetes atlas, 2nd ed. Brussels: International Diabetes Federation; 2003. http://www.eatlas.idf.org/ webdata/docs/Atlas%202003-Summary.pdf (accessed June 19, 2007). [2] Alberti K, Zimmet P. Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1. Diagnosis and classification of diabetes mellitus–—provisional report of a WHO Consultation. Diabetic Med 1998;15:539—53. [3] Guthrie RA, Guthrie DW, editors. Nursing management of diabetes mellitus. 5th ed., New York: Springer Publishing; 2002. [4] Pinhas-Hamiel O, Zeitler P. Acute and chronic complications of type 2 diabetes mellitus in children and adolescents. Lancet 2007;369:1823—31. [5] Pickup JC, Williams G, editors. Textbook of diabetes. 3rd ed., Oxford: Blackwell Science; 2003. [6] Lorig K, Holman H. Self management education: history, definition and outcomes and mechanisms. Ann Behav Med 2003;26(1):1—7. doi:10.1207/S15324796ABM2601_01. [7] Smith R. Improving the management of chronic disease. Br Med J 2003;327. doi:10.1136/bmj.327.7405.12. [8] Department of Health. Supporting people with long term conditions: an NHS and social care model to support local innovation and integration. London: Department of Health; Crown copyright 2005. [9] Department of Health. Self care: a real choice. London: Department of Health; Crown copyright 2005. [10] Nissen SE, Wolski K. Effect of rosiglitazone on the risk of myocardial infarction and death. N Engl J Med 2007;365. doi:10.1056/NEJMoa072761. [11] Dash M, Liu H. Consistency-based search in feature selection. Artif Intell 2003;151:155—76. [12] Lavrac N. Data mining in medicine: selected techniques and applications. In: Proceedings of the second international conference on the practical application of knowledge discovery and data mining. London: The Practical Applications Company; 1998. p. 11—31. [13] Mitchell M, editor. Machine learning. New York: McGraw- Hill; 1997. [14] Martin B. Instance-based learning: nearest neighbour with generalisation. PhD thesis. Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1995. [15] Lewis D, Gale W. A sequential algorithm for training text classifiers. In: Croft BW, Rijsbergen CJ, editors. Proceedings of the seventeenth annual ACM-SIGIR conference on research and development in information retrieval. Springer-Verlag; 1994. p. 3—12. [16] Rish I, Hellerstein J, Thathachar J. An analysis of data characteristics that affect Naı¨ve Bayes performance. New York. IBM Technical Report; 2002. http://www.research.ibm.com/ PM/icml01.pdf (accessed June 19, 2007). [17] Topon KP. Gene expression based cancer classification using evolutionary and non-evolutionary methods. Technical Report No. 041105A1. Japan: Department of Frontier Informatics, The University of Tokyo; 2004. [18] Cornforth D, Jelinek H, Peichl L. Fractop: a tool for automated biological image classification. In: Sarker, McKay, Gen, Namatame, editors. Proceedings of the sixth Australia— Japan joint workshop on intelligent and evolutionary systems. 2002. p. 141—8. [19] Aires R, Manfrin A, Aluisio S, Santos D. Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users needs? Technical Report NILC-TR-04-09. Brasil: University de Sao Paulo; 2004. [20] Hall M. Correlation-based feature selection for machine learning. PhD thesis. Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1999. http:// www.cs.waikato.ac.nz/�mhall/thesis.pdf (accessed June 19, 2007). [21] Inza I, Sierra B, Blanco R, Larranaga P. Gene selection by sequential search wrapper approaches in microarry cancer class prediction. J Intell Fuzzy Syst 2002;12(1):25—32. [22] Hall M, Holmes G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowledge Data Eng 2003;15:1437—47. [23] Sierra B, Lazkano E. Probabilistic-weighted k-nearest neighbour algorithm: a new approach for gene expression-based classification. Knowledge-Based Intell Inf Eng 2003;932—9. [24] Su CT, Yang CH, Hsu KH, Chiu WK. Data mining for the diagnosis for type II diabetes from three-dimensional body surface anthropometrical scanning data. Comput Math Appl 2006;51:1075—92. [25] Huang Y, McCullagh PJ, Black ND. Feature selection via supervised model construction. In: Bramer M, editor. Proceedings of the 4th IEEE international conference on data mining. 2004. p. 411—4. [26] Kononenko I. Estimating attributes: analysis and extension of relief. In: Proceedings of the seventh European conference in machine learning. Springer-Verlag; 1994 . p. 171—82. [27] Demsar J, Zupan B, Aoki N, Wall M, Granchi T, Beck J. Feature mining and predictive model construction from severe trauma patient’s data. Int J Med Inf 2001;63:41—50. [28] Kononenko I, Simec E. Induction of decision trees with RELIEFF. In: Proceedings of ISSEK workshop on mathematical and statistical methods in artificial intelligence. New York: Springer; 1995. p. 199—220. [29] Robnik M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 2003;53:23—69. [30] Fayyad U, Piatesky-Shapiro G, Smyth P, editors. Advances in knowledge discovery and data mining. AAAI/MIT Press; 1996. [31] Kauderer K, Mucha H, editors. Classification, data analysis and data highways. New York: Springer-Verlag; 1997. [32] Schohn G, Cohn D. Less is more: active learning with support vector machines. In: Pat Langley, editor. Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann; 2000. p. 839—46. [33] Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reduction. In: Brodley CE, Pohoreckyj Danyluk A, editors. Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann; 2001. p. 441—8. [34] Liu H, Motoda H, Yu L. A selective sampling approach to active feature selection. Artif Intell 2004;159:49—74. [35] Aha D, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn 1991;6:37—66. [36] Kantardzic M, editor. Data mining: concepts, models, methods, and algorithms. New Jersey: Wiley-IEEE Press; 2002. [37] Demsar J, Zupan B, Aoki N, Wall MJ, Granchi TH, Beck JR. Feature mining and predictive model construction from severe trauma patient’s data. Int J Med Inf Elsevier Science 2001;63:41—50. [38] Molina L, Belanche L, Nebot A. Feature selection algorithms: a survey and experimental evaluation. In: Proceeding of IEEE international conference on data mining, IEEE. 2002. p. 306—13. [39] van Bemmel J, Musen M, editors. Handbook of medical informatics. New York: Springer; 1997. [40] Perner P. Improving the accuracy of decision tree induction by feature pre-selection. Appl Artif Intell 2001;15(8):747—60. [41] Grzymala-Busse J. Data mining in bioinformatics. Technical Report. USA; University of Kansas; 2003. [42] Hall L, Collins R, Bowyer K, Banfield R. Error-based pruning of decision trees grown on very large data sets can work. In: Proceedings of 14th IEEE international conference on tools for artificial intelligence; 2002. p. 233—8. [43] Bennett P. Epidemiology of diabetes mellitus. In: Rifkin H, Porte D, editors. Ellenberg and Rifkin’s diabetes mellitus. New York: Elsevier; 1990. p. 363—77. [44] Croxson S, Burden A, Bodlington M, Bostha J. The prevalence of diabetes in elderly people. Diabetic Med 1991;8:28—31. [45] Newman B, Selby J, King M. Concordance for type 2 diabetes mellitus (NIDDM) in male twins. Diabetiologia 1987;30: 763—8. [46] Knowler W, Pettitt D, Saad M. Diabetes mellitus in the pima Indians: Incidence, risk factors and pathogenesis. Diabetes Metab Rev 1990;6:1—27. [47] Harris M. Epidemiological correlates of NIDDM in Hispanics, Whites, and Blacks in the US population. Diabetes Care 1991;14:639—48. [48] Marcovecchio M, Mohn A, Chiarelli F. Type 2 diabetes mellitus in children and adolescents. J Endocrinol Investig 2005;28:853—63. [49] Wong T, Barr E, Tapp R, Harper C, Taylor H, Zimmet P, et al. Retinopathy in persons with impaired glucose metabolism: the Australian diabetes obesity and lifestyle (AusDiab) study. Am J Ophthalmol 2005;140:1157—9. [50] Hansen B, Bodkin N. Primary prevention of diabetes mellitus by prevention of obesity in monkeys. Diabetes 1993;42: 1809—14. [51] Brug J, Campbell M, van Assema P. The application and impact of computer generated personalized nutrition education: a review of the literature. Patient Educ Counsel 1999;36:145—56. [52] Diabetes Prevention Program Research Group. Reduction in the incidence of Type II diabetes with lifestyle intervention or metformin. N Engl J Med 2002;346(6):393—403. [53] Franz MJ. The answer to weight loss is easy–—doing it is hard! Clin Diabetes 2001;19(3):105—9. [54] Vijan S, Hayward RA. Treatment of hypertension in Type 2 diabetes mellitus: blood pressure goals, choice of agents, and setting priorities in diabetes care. Ann Intern Med 2003;138:593—602. [55] The American College of Physicians. Blood pressure control in people with Type 2 diabetes mellitus: recommendations from the American College of Physicians. Ann Intern Med 2006;138:1—70. [56] Snow V, Weiss KB, Mottur-Pilson C. The evidence base for tight blood pressure control in the management of Type 2 diabetes mellitus. Ann Intern Med 2003;138:587—92. [57] Bakris G, Weir M, DeQuattro M, McManhon F. Effects of an ace inhibitor/calcium antagonist combination on proteinuria in diabetic nephropathy. Kidney Int 1998;54:1283—9. [58] Cheraskin E. The breakfast/lunch/dinner ritual. J Orthomol Med 1993;8:6—10. [59] West K, Ahuja M, Bennett B, Czyzyk A, DeAcosta O, Fuller J. The role of circulating glucose and triglyceride concentrations and their interactions with other ‘risk factors’ as determinants of arterial disease in nine diabetic population samples from the who multinational study. Diabetes Care 1983;6:361—9. [60] Standl E, Stiegler H, Janka H, Mehnert H. Risk profile of macrovascular disease in diabetes mellitus. Diabetes Metab 1988;14:505—11. [61] Fontbonne A, Thibult N, Eschwege E, Ducimetiere P. Body fat distribution and coronary heart disease mortality in subjects with impaired glucose tolerance or diabetes mellitus: the paris prospective study 15-year follow-up. Diabetologia 1992;35:464—8. [62] Rimm E, Chan J, Stampfer M, Colditz G, Willett W. Prospective study of cigarette smoking, alcohol use, and the risk of diabetes in men. Br Med J 1995;310:555—9. [63] Wannamethee, Shaper SA, Perry I. Smoking as a modifiable risk factor for type 2 diabetes in middle-aged men. Diabetes Care 2001;24:1590—5. [64] Sairenchi T, Iso H, Nishimura A, Hosoda T, Irie F. Cigarette smoking and risk of type 2 diabetes mellitus among middleaged and elderly Japanese men and women. Am J Epidemiol 2004;160:158—62. [65] Chen M, Han J, Yu P. Data mining: an overview from a database perspective. IEEE Trans Knowledge Data Eng 1996;8:866—83. [66] Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) Workshop support vector machines; 1999. p. 55—60. [67] Newman DJ, Hettich S, Blake CL, Merz CJ. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science; 1998. http://www.ics.uci.edu/�mlearn/MLRepository.html (accessed June 19, 2007). [68] Crone S, Lessmann S, Stahlbock R. Empirical comparison and evaluation of classifier performance for data mining in customer relationship management. In: Wunsch D, et al., editors. Proceedings of the international joint conference on neural networks, IJCNN’04. 2004. p. 443—8.

PY - 2007

Y1 - 2007

N2 - SummaryObjectiveDiabetes affects between 2% and 4% of the global population (up to 10% in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection to a working patient management system to assist with ranking, classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors.MethodsThe diabetic patients’ information was collected by Ulster Community and Hospitals Trust (UCHT) from year 2000 to 2004 as part of clinical management. In order to discover key predictors and latent knowledge, data mining techniques were applied. To improve computational efficiency, a feature selection technique, feature selection via supervised model construction (FSSMC), an optimisation of ReliefF, was used to rank the important attributes affecting diabetic control. After selecting suitable features, three complementary classification techniques (Naïve Bayes, IB1 and C4.5) were applied to the data to predict how well the patients’ condition was controlled.ResultsFSSMC identified patients’ ‘age’, ‘diagnosis duration’, the need for ‘insulin treatment’, ‘random blood glucose’ measurement and ‘diet treatment’ as the most important factors influencing blood glucose control. Using the reduced features, a best predictive accuracy of 95% and sensitivity of 98% was achieved. The influence of factors, such as ‘type of care’ delivered, the use of ‘home monitoring’, and the importance of ‘smoking’ on outcome can contribute to domain knowledge in diabetes control.ConclusionIn the care of patients with diabetes, the more important factors identified: patients’ ‘age’, ‘diagnosis duration’ and ‘family history’, are beyond the control of physicians. Treatment methods such as ‘insulin’, ‘diet’ and ‘tablets’ (a variety of oral medicines) may be controlled. However lifestyle indicators such as ‘body mass index’ and ‘smoking status’ are also important and may be controlled by the patient. This further underlines the need for public health education to aid awareness and prevention. More subtle data interactions need to be better understood and data mining can contribute to the clinical evidence base. The research confirms and to a lesser extent challenges current thinking. Whilst fully appreciating the requirement for clinical verification and interpretation, this work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record. FSSMC has proved a useful feature estimator for large data sets, where processing efficiency is an important factor.

AB - SummaryObjectiveDiabetes affects between 2% and 4% of the global population (up to 10% in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection to a working patient management system to assist with ranking, classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors.MethodsThe diabetic patients’ information was collected by Ulster Community and Hospitals Trust (UCHT) from year 2000 to 2004 as part of clinical management. In order to discover key predictors and latent knowledge, data mining techniques were applied. To improve computational efficiency, a feature selection technique, feature selection via supervised model construction (FSSMC), an optimisation of ReliefF, was used to rank the important attributes affecting diabetic control. After selecting suitable features, three complementary classification techniques (Naïve Bayes, IB1 and C4.5) were applied to the data to predict how well the patients’ condition was controlled.ResultsFSSMC identified patients’ ‘age’, ‘diagnosis duration’, the need for ‘insulin treatment’, ‘random blood glucose’ measurement and ‘diet treatment’ as the most important factors influencing blood glucose control. Using the reduced features, a best predictive accuracy of 95% and sensitivity of 98% was achieved. The influence of factors, such as ‘type of care’ delivered, the use of ‘home monitoring’, and the importance of ‘smoking’ on outcome can contribute to domain knowledge in diabetes control.ConclusionIn the care of patients with diabetes, the more important factors identified: patients’ ‘age’, ‘diagnosis duration’ and ‘family history’, are beyond the control of physicians. Treatment methods such as ‘insulin’, ‘diet’ and ‘tablets’ (a variety of oral medicines) may be controlled. However lifestyle indicators such as ‘body mass index’ and ‘smoking status’ are also important and may be controlled by the patient. This further underlines the need for public health education to aid awareness and prevention. More subtle data interactions need to be better understood and data mining can contribute to the clinical evidence base. The research confirms and to a lesser extent challenges current thinking. Whilst fully appreciating the requirement for clinical verification and interpretation, this work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record. FSSMC has proved a useful feature estimator for large data sets, where processing efficiency is an important factor.

KW - Type 2 diabetes

KW - Blood glucose

KW - Data mining

KW - Classification

KW - Feature selection

U2 - 10.1016/j.artmed.2007.07.002

DO - 10.1016/j.artmed.2007.07.002

M3 - Article

VL - 41

SP - 251

EP - 262

JO - Artificial Intelligence in Medicine

T2 - Artificial Intelligence in Medicine

JF - Artificial Intelligence in Medicine

SN - 0933-3657

IS - 3

ER -