10 UMLClass 30 50 470 60 halign=left Data mining goal: Feature selection and classification project plan UMLSpecialState 30 170 20 20 type=initial Relation 40 170 90 30 lt=<- 70.0;10.0;10.0;10.0 UMLState 110 150 200 60 Obtain interesting dataset from the sql project plan UMLState 370 140 140 90 Prepare data for modelling with the mulset R package UMLSpecialState 690 130 160 100 type=decision All datasets used for modelling? Relation 300 170 90 30 lt=<- 70.0;10.0;10.0;10.0 UMLState 560 140 100 90 Multiple of datasets are generated Relation 500 170 80 30 lt=<- 60.0;10.0;10.0;10.0 Relation 650 170 60 30 lt=<- 40.0;10.0;10.0;10.0 Relation 100 220 690 140 lt=<- [NO] 50.0;120.0;10.0;120.0;10.0;30.0;670.0;30.0;670.0;10.0 UMLState 540 290 150 110 Train and validate a selection of classifiers on training set, using cross validation UMLState 720 300 150 80 Filter models with bad AUROC, specificity, or sensitivity Relation 680 330 60 30 lt=<- 40.0;10.0;10.0;10.0 UMLState 150 300 150 70 Split data in training and test set Relation 290 320 70 30 lt=<- 50.0;10.0;10.0;10.0 UMLState 340 310 150 60 Set random seed Relation 480 330 80 30 lt=<- 60.0;10.0;10.0;10.0 UMLState 900 310 180 60 Compare models in terms of training and test AUROC Relation 860 330 60 30 lt=<- 40.0;10.0;10.0;10.0 Relation 760 220 250 110 lt=<- 10.0;10.0;230.0;10.0;230.0;90.0 Relation 840 80 90 120 lt=<- [YES] 70.0;10.0;10.0;100.0 UMLState 910 40 150 90 Compute variable importance for models using caret package Relation 980 120 80 70 lt=<- 60.0;50.0;10.0;50.0;10.0;10.0 UMLState 1040 140 140 70 Perform correlation analysis and visualise results Relation 1100 200 80 80 lt=<- 60.0;60.0;10.0;60.0;10.0;10.0 UMLSpecialState 1160 250 20 20 type=final