diff options
| author | Mike Vink <mike1994vink@gmail.com> | 2021-05-02 17:46:19 +0200 |
|---|---|---|
| committer | Mike Vink <mike1994vink@gmail.com> | 2021-05-02 17:46:19 +0200 |
| commit | db35b461da9db8c1b4cc6a09245a9771fd956eeb (patch) | |
| tree | 858171bcd369bb9189b3857478ba7ba6b93d32bb /data_preparation_modeling/main.tex | |
| parent | bf1adece8aeb48e136085233d2f5ff2f9600eaf5 (diff) | |
changed symlinks
Diffstat (limited to 'data_preparation_modeling/main.tex')
| -rw-r--r-- | data_preparation_modeling/main.tex | 236 |
1 files changed, 236 insertions, 0 deletions
diff --git a/data_preparation_modeling/main.tex b/data_preparation_modeling/main.tex new file mode 100644 index 0000000..4b72ed4 --- /dev/null +++ b/data_preparation_modeling/main.tex @@ -0,0 +1,236 @@ +% hello +\input{../preamble.tex} + +\makeglossaries +\input{../bussiness_glossary.tex} +\input{../data_mining_glossary.tex} +\input{../acronyms.tex} + +\begin{document} +\MyTitle{Data Preparation and Modelling Report} +\tableofcontents +\printglossary[type=bus] +\printglossary[type=dm] +\printglossary[type=\acronymtype] + +\section{Data selection} + +The data that we use in this work is based on the data used in the simon +manuscript \citep{tomicSIMONAutomatedMachine2019}. This subset of the fluprint +database comprises data from 5 clinical studies, most importantly the +longitudal study SLVP015. It only uses the first visits of donors, as the +classification is the most complete in this dataset +\autoref{fig:dataRepeatvisits}. We will use this dataset to model the high +vaccine response with as predictors the in total 3285 features measured in +assays done by the clinical studies. The data will be prepared in the same way +as in the original work, using the mulset algorithm +\citep{tomicSIMONAutomatedMachine2019}. + +\begin{figure} + \includegraphics[width=\textwidth]{data_selection} + \caption{caption}\label{fig:dataRepeatvisits} +\end{figure} + +In addition to repeating a similar procedure as in +\cite{tomicSIMONAutomatedMachine2019}, we will compare the values of features +selected by the models trained on the first visit data to those of second visit +data. Initially the plan was to train new models on the second visit data, +however the classes are extremely unbalanced in the data of repeat visits +\autoref{fig:dataRepeatvisits}. For example in the first visit there are 65 +high responders and 130 low responders, in the second visit there is only data +available for 6 high responders and 44 low responders. Therefore we will only +train model on first visit data, and use the knowledge gained to explore second +visit data. + +\section{Clean data} + +The goal is to obtain one or more tables from the simon data suitable to train +models, thus we are looking to change the data from the long format as in the +database in a wide format where each column is a features measured in an assay. +When attempting to do this it was discovered that some assay data contained +duplicate readouts \autoref{tbl:exampleDuplicate}. Since the values were all +similar it was decided to aggregate the values to unique features using the +mean value. + +\begin{table} +\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed +\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed +\begin{tabular}{rrrrrlrllrrl} +\toprule{} +donor\_id & study & age & outcome & year & type & hai\_response & name & data\_name & assay & data & dup\\ +\midrule{} +285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.8 & TRUE\\ +285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.1 & TRUE\\ +285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.3 & TRUE\\ +285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.0 & TRUE\\ +\bottomrule{} +\end{tabular} + \caption{}\label{tbl:exampleDuplicate} +\end{table} + +The obtained first visit wide data had dimensions of 195x3284 (donors by +measured features), with 596736 missing value cells (93\% sparsity). The second +visit data had dimensions of 50x3251, with a lower sparsity (58\%) since the donor +population is smaller and they are from the same clinical study. + +\subsection{Mulset alogrithm} + +Following the procedure in \cite{tomicSIMONAutomatedMachine2019} we deal with +this data sparsity by applying the mulset algorithm created by +\cite{tomicSIMONAutomatedMachine2019}. This is necessary since data is missing +in every column and the lack of prior knowledge doesn't allow for imputation of +missing values, precluding conventional measures of missing data cleaning. + +\begin{figure}[ht] + \includegraphics[width=\textwidth]{F2.large} + \caption{\textbf{taken from original work}}\label{fig:mulsetAlg} +\end{figure} + +The mulset algorithm uses the intersection of features sets of donors to +calculate pairwise shared feature sets. For every shared feature set it then +retrieves all donors that have values for these features \autoref{fig:mulsetAlg}. + +\begin{lstlisting}[caption=Applying the mulset algorithm, label={lst:mulsetStep}] +% Step 1: generate re-sampled intersection datasets suitable for analysis +for {each subject in data} do: + Calculate intersection between subject and all other subjects using mulset algorithm + Skip sets that have less than 5 features and less than 15 donors in common +end for; +# Save all shared intersections to corresponding datasets +\end{lstlisting} + +Applying the algorithm resulted in 47 different datasets without missing values +which contained a subset of donors and features, and the vaccine response +classification. Further, the same criteria as in the original work were +applied, the number of datasets was filtered down to 36 by excluding datasets +that had less than 15 donors or less than 5 features \autoref{lst:mulsetStep}. + +To prepare the datasets for modelling they were partitioned into trianing and +test sets consisting of 75\% and 25\% of the data respectively. To ensure that +out of sample point estimates were not based on nonsense, datasets with a test +set containing less than 10 donors/rows were discarded. The resulting number of +cleaned datasets for modelling purposes was 20 \autoref{tbl:mulsetDatasets}. A +significant number of datasets contained more predictors than samples, however +we consider this as an inevitible phenomenon and not an absolute obstacle since +the purpose of the models is not to discriminate vaccine responders with the +highest accuracy, but to select features from that correlate with a vaccine +response from the great number of features. + +\begin{table} + \begin{tabularx}{\textwidth}{XXXXX} +\toprule{} +dataset & Rows x Cols & total (low / high (low \%)) & train (low / high) & test (low / high)\\ +\midrule{} +1 & 61 x 78 & 43 / 18 ( 0.7 ) & 33 / 14 & 10 / 4\\ +2 & 105 x 101 & 62 / 43 ( 0.59 ) & 47 / 33 & 15 / 10\\ +3 & 140 x 50 & 94 / 46 ( 0.67 ) & 71 / 35 & 23 / 11\\ +4 & 63 x 269 & 38 / 25 ( 0.6 ) & 29 / 19 & 9 / 6\\ +5 & 62 x 293 & 38 / 24 ( 0.61 ) & 29 / 18 & 9 / 6\\ +\addlinespace +6 & 68 x 237 & 42 / 26 ( 0.62 ) & 32 / 20 & 10 / 6\\ +7 & 67 x 44 & 47 / 20 ( 0.7 ) & 36 / 15 & 11 / 5\\ +8 & 111 x 93 & 66 / 45 ( 0.59 ) & 50 / 34 & 16 / 11\\ +9 & 73 x 54 & 58 / 15 ( 0.79 ) & 44 / 12 & 14 / 3\\ +10 & 40 x 105 & 28 / 12 ( 0.7 ) & 21 / 9 & 7 / 3\\ +\addlinespace +11 & 46 x 97 & 32 / 14 ( 0.7 ) & 24 / 11 & 8 / 3\\ +12 & 137 x 53 & 78 / 59 ( 0.57 ) & 59 / 45 & 19 / 14\\ +13 & 48 x 42 & 35 / 13 ( 0.73 ) & 27 / 10 & 8 / 3\\ +14 & 91 x 38 & 62 / 29 ( 0.68 ) & 47 / 22 & 15 / 7\\ +15 & 42 x 37 & 36 / 6 ( 0.86 ) & 27 / 5 & 9 / 1\\ +\addlinespace +16 & 92 x 26 & 62 / 30 ( 0.67 ) & 47 / 23 & 15 / 7\\ +17 & 88 x 6 & 68 / 20 ( 0.77 ) & 51 / 15 & 17 / 5\\ +18 & 82 x 87 & 56 / 26 ( 0.68 ) & 42 / 20 & 14 / 6\\ +19 & 151 x 51 & 92 / 59 ( 0.61 ) & 69 / 45 & 23 / 14\\ +20 & 83 x 75 & 56 / 27 ( 0.67 ) & 42 / 21 & 14 / 6\\ +\bottomrule{} +\end{tabularx} + \caption{caption}\label{tbl:mulsetDatasets} +\end{table} + + +\section{Modelling} + +The modelling techniques of choice were to be resistent to the "too many +features" problem and suitable for selecting features in an embedded based +approach \citep{hiraReviewFeatureSelection2015}. Technically, the approach used +here is a wrapper approach since we are using the mulset algorithm to generate +different subsets of features and training machine learning models on those +features. However, in this work we train three models that have an embedded +mechanism for obtaining the most important predictors of vaccine response. This +is done for every feature set, and manually we chose the best and most +interesting trained models and their obtained features. The end goal was to +identify important features and investigating the change in those features for the +second visit/influenza season of donors. + +\begin{table} +\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed +\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed +\begin{tabular}{llrrrrrrrrrrrrr} +\toprule{} +dataset &model & SENS & SPEC & MCC & PREC & NPV & FPR & F1 & TP & FP & TN & FN & train AUC & test AUC\\ +\midrule{} +14 & rrlda & 0.091 & 0.915 & 0.010 & 0.333 & 0.683 & 0.085 & 0.143 & 2 & 4 & 43 & 20 & 0.50 & 0.62\\ + & nb & 0.636 & 0.702 & 0.321 & 0.500 & 0.805 & 0.298 & 0.560 & 14 & 14 & 33 & 8 & 0.67 & 0.59\\ + & rf & 0.364 & 0.851 & 0.243 & 0.533 & 0.741 & 0.149 & 0.432 & 8 & 7 & 40 & 14 & 0.65 & 0.61\\ + & reglog & 0.227 & 0.766 & -0.007 & 0.312 & 0.679 & 0.234 & 0.263 & 5 & 11 & 36 & 17 & 0.49 & 0.48\\ +\addlinespace +16 & rrlda & 0.000 & 1.000 & NaN & NaN & 0.671 & 0.000 & 0.000 & 0 & 0 & 47 & 23 & 0.48 & 0.61\\ + & nb & 0.652 & 0.617 & 0.253 & 0.455 & 0.784 & 0.383 & 0.536 & 15 & 18 & 29 & 8 & 0.68 & 0.55\\ + & rf & 0.261 & 0.851 & 0.135 & 0.462 & 0.702 & 0.149 & 0.333 & 6 & 7 & 40 & 17 & 0.65 & 0.69\\ + & reglog & 0.391 & 0.723 & 0.116 & 0.409 & 0.708 & 0.277 & 0.400 & 9 & 13 & 34 & 14 & 0.64 & 0.47\\ +\addlinespace +19 & rrlda & 0.533 & 0.391 & -0.075 & 0.364 & 0.562 & 0.609 & 0.432 & 24 & 42 & 27 & 21 & 0.47 & 0.41\\ + & nb & 0.489 & 0.565 & 0.053 & 0.423 & 0.629 & 0.435 & 0.454 & 22 & 30 & 39 & 23 & 0.54 & 0.48\\ + & rf & 0.244 & 0.739 & -0.018 & 0.379 & 0.600 & 0.261 & 0.297 & 11 & 18 & 51 & 34 & 0.54 & 0.52\\ + & reglog & 0.267 & 0.754 & 0.023 & 0.414 & 0.612 & 0.246 & 0.324 & 12 & 17 & 52 & 33 & 0.51 & 0.32\\ +\bottomrule{} +\end{tabular} + +\end{table} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{dataset1_nb_feature_exploration} + \caption{dataset1-nb-feature-exploration} + \label{fig:dataset1-nb-feature-exploration} +\end{figure} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{dataset2_nb_feature_exploration} + \caption{dataset2-nb-feature-exploration} + \label{fig:dataset2-nb-feature-exploration} +\end{figure} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{second_visit_change1} + \caption{second-visit-change1} + \label{fig:second-visit-change1} +\end{figure} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{second_visit_change2} + \caption{second-visit-change1} + \label{fig:second-visit-change1} +\end{figure} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{cor_dataset1} + \caption{cor-dataset1} + \label{fig:cor-dataset1} +\end{figure} + +\begin{figure}[htpb] + \centering + \includegraphics[width=\textwidth]{cor_dataset2} + \caption{cor-dataset2} + \label{fig:cor-dataset2} +\end{figure} + + +\end{document} |
