changed symlinks

author: Mike Vink <mike1994vink@gmail.com> 2021-05-02 17:46:19 +0200
committer: Mike Vink <mike1994vink@gmail.com> 2021-05-02 17:46:19 +0200
commit: db35b461da9db8c1b4cc6a09245a9771fd956eeb (patch)
tree: 858171bcd369bb9189b3857478ba7ba6b93d32bb /data_preparation_modeling/main.tex
parent: bf1adece8aeb48e136085233d2f5ff2f9600eaf5 (diff)
1 files changed, 236 insertions, 0 deletions
diff --git a/data_preparation_modeling/main.tex b/data_preparation_modeling/main.tex
new file mode 100644
index 0000000..4b72ed4
--- /dev/null
+++ b/data_preparation_modeling/main.tex
@@ -0,0 +1,236 @@
+% hello
+\input{../preamble.tex}
+
+\makeglossaries
+\input{../bussiness_glossary.tex}
+\input{../data_mining_glossary.tex}
+\input{../acronyms.tex}
+
+\begin{document}
+\MyTitle{Data Preparation and Modelling Report}
+\tableofcontents
+\printglossary[type=bus]
+\printglossary[type=dm]
+\printglossary[type=\acronymtype]
+
+\section{Data selection}
+
+The data that we use in this work is based on the data used in the simon
+manuscript \citep{tomicSIMONAutomatedMachine2019}. This subset of the fluprint
+database comprises data from 5 clinical studies, most importantly the
+longitudal study SLVP015. It only uses the first visits of donors, as the
+classification is the most complete in this dataset
+\autoref{fig:dataRepeatvisits}. We will use this dataset to model the high
+vaccine response with as predictors the in total 3285 features measured in
+assays done by the clinical studies. The data will be prepared in the same way
+as in the original work, using the mulset algorithm
+\citep{tomicSIMONAutomatedMachine2019}.
+
+\begin{figure}
+    \includegraphics[width=\textwidth]{data_selection}
+    \caption{caption}\label{fig:dataRepeatvisits}
+\end{figure}
+
+In addition to repeating a similar procedure as in
+\cite{tomicSIMONAutomatedMachine2019}, we will compare the values of features
+selected by the models trained on the first visit data to those of second visit
+data. Initially the plan was to train new models on the second visit data,
+however the classes are extremely unbalanced in the data of repeat visits
+\autoref{fig:dataRepeatvisits}. For example in the first visit there are 65
+high responders and 130 low responders, in the second visit there is only data
+available for 6 high responders and 44 low responders. Therefore we will only
+train model on first visit data, and use the knowledge gained to explore second
+visit data.
+
+\section{Clean data}
+
+The goal is to obtain one or more tables from the simon data suitable to train
+models, thus we are looking to change the data from the long format as in the
+database in a wide format where each column is a features measured in an assay.
+When attempting to do this it was discovered that some assay data contained
+duplicate readouts \autoref{tbl:exampleDuplicate}. Since the values were all
+similar it was decided to aggregate the values to unique features using the
+mean value.
+
+\begin{table}
+\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed
+\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed
+\begin{tabular}{rrrrrlrllrrl}
+\toprule{}
+donor\_id & study & age & outcome & year & type & hai\_response & name & data\_name & assay & data & dup\\
+\midrule{}
+285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.8 & TRUE\\
+285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.1 & TRUE\\
+285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.3 & TRUE\\
+285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.0 & TRUE\\
+\bottomrule{}
+\end{tabular}
+    \caption{}\label{tbl:exampleDuplicate}
+\end{table}
+
+The obtained first visit wide data had dimensions of 195x3284 (donors by
+measured features), with 596736 missing value cells (93\% sparsity). The second
+visit data had dimensions of 50x3251, with a lower sparsity (58\%) since the donor
+population is smaller and they are from the same clinical study.
+
+\subsection{Mulset alogrithm}
+
+Following the procedure in \cite{tomicSIMONAutomatedMachine2019} we deal with
+this data sparsity by applying the mulset algorithm created by
+\cite{tomicSIMONAutomatedMachine2019}. This is necessary since data is missing
+in every column and the lack of prior knowledge doesn't allow for imputation of
+missing values, precluding conventional measures of missing data cleaning.
+
+\begin{figure}[ht]
+    \includegraphics[width=\textwidth]{F2.large}
+    \caption{\textbf{taken from original work}}\label{fig:mulsetAlg}
+\end{figure}
+
+The mulset algorithm uses the intersection of features sets of donors to
+calculate pairwise shared feature sets. For every shared feature set it then
+retrieves all donors that have values for these features \autoref{fig:mulsetAlg}.
+
+\begin{lstlisting}[caption=Applying the mulset algorithm, label={lst:mulsetStep}]
+% Step 1: generate re-sampled intersection datasets suitable for analysis
+for {each subject in data} do:
+	Calculate intersection between subject and all other subjects using mulset algorithm
+	Skip sets that have less than 5 features and less than 15 donors in common
+end for;
+# Save all shared intersections to corresponding datasets
+\end{lstlisting}
+
+Applying the algorithm resulted in 47 different datasets without missing values
+which contained a subset of donors and features, and the vaccine response
+classification. Further, the same criteria as in the original work were
+applied, the number of datasets was filtered down to 36 by excluding datasets
+that had less than 15 donors or less than 5 features \autoref{lst:mulsetStep}.
+
+To prepare the datasets for modelling they were partitioned into trianing and
+test sets consisting of 75\% and 25\% of the data respectively. To ensure that
+out of sample point estimates were not based on nonsense, datasets with a test
+set containing less than 10 donors/rows were discarded. The resulting number of
+cleaned datasets for modelling purposes was 20 \autoref{tbl:mulsetDatasets}. A
+significant number of datasets contained more predictors than samples, however
+we consider this as an inevitible phenomenon and not an absolute obstacle since
+the purpose of the models is not to discriminate vaccine responders with the
+highest accuracy, but to select features from that correlate with a vaccine
+response from the great number of features.
+
+\begin{table}
+    \begin{tabularx}{\textwidth}{XXXXX}
+\toprule{}
+dataset & Rows x Cols & total (low / high (low \%)) & train (low / high) & test (low / high)\\
+\midrule{}
+1 & 61 x 78 & 43 / 18 ( 0.7 ) & 33 / 14 & 10 / 4\\
+2 & 105 x 101 & 62 / 43 ( 0.59 ) & 47 / 33 & 15 / 10\\
+3 & 140 x 50 & 94 / 46 ( 0.67 ) & 71 / 35 & 23 / 11\\
+4 & 63 x 269 & 38 / 25 ( 0.6 ) & 29 / 19 & 9 / 6\\
+5 & 62 x 293 & 38 / 24 ( 0.61 ) & 29 / 18 & 9 / 6\\
+\addlinespace
+6 & 68 x 237 & 42 / 26 ( 0.62 ) & 32 / 20 & 10 / 6\\
+7 & 67 x 44 & 47 / 20 ( 0.7 ) & 36 / 15 & 11 / 5\\
+8 & 111 x 93 & 66 / 45 ( 0.59 ) & 50 / 34 & 16 / 11\\
+9 & 73 x 54 & 58 / 15 ( 0.79 ) & 44 / 12 & 14 / 3\\
+10 & 40 x 105 & 28 / 12 ( 0.7 ) & 21 / 9 & 7 / 3\\
+\addlinespace
+11 & 46 x 97 & 32 / 14 ( 0.7 ) & 24 / 11 & 8 / 3\\
+12 & 137 x 53 & 78 / 59 ( 0.57 ) & 59 / 45 & 19 / 14\\
+13 & 48 x 42 & 35 / 13 ( 0.73 ) & 27 / 10 & 8 / 3\\
+14 & 91 x 38 & 62 / 29 ( 0.68 ) & 47 / 22 & 15 / 7\\
+15 & 42 x 37 & 36 / 6 ( 0.86 ) & 27 / 5 & 9 / 1\\
+\addlinespace
+16 & 92 x 26 & 62 / 30 ( 0.67 ) & 47 / 23 & 15 / 7\\
+17 & 88 x 6 & 68 / 20 ( 0.77 ) & 51 / 15 & 17 / 5\\
+18 & 82 x 87 & 56 / 26 ( 0.68 ) & 42 / 20 & 14 / 6\\
+19 & 151 x 51 & 92 / 59 ( 0.61 ) & 69 / 45 & 23 / 14\\
+20 & 83 x 75 & 56 / 27 ( 0.67 ) & 42 / 21 & 14 / 6\\
+\bottomrule{}
+\end{tabularx}
+    \caption{caption}\label{tbl:mulsetDatasets}
+\end{table}
+
+
+\section{Modelling}
+
+The modelling techniques of choice were to be resistent to the "too many
+features" problem and suitable for selecting features in an embedded based
+approach \citep{hiraReviewFeatureSelection2015}. Technically, the approach used
+here is a wrapper approach since we are using the mulset algorithm to generate
+different subsets of features and training machine learning models on those
+features. However, in this work we train three models that have an embedded
+mechanism for obtaining the most important predictors of vaccine response. This
+is done for every feature set, and manually we chose the best and most
+interesting trained models and their obtained features. The end goal was to
+identify important features and investigating the change in those features for the
+second visit/influenza season of donors.
+
+\begin{table}
+\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed
+\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed
+\begin{tabular}{llrrrrrrrrrrrrr}
+\toprule{}
+dataset &model & SENS & SPEC & MCC & PREC & NPV & FPR & F1 & TP & FP & TN & FN & train AUC & test AUC\\
+\midrule{}
+14    & rrlda & 0.091 & 0.915 & 0.010 & 0.333 & 0.683 & 0.085 & 0.143 & 2 & 4 & 43 & 20 & 0.50 & 0.62\\
+    & nb & 0.636 & 0.702 & 0.321 & 0.500 & 0.805 & 0.298 & 0.560 & 14 & 14 & 33 & 8 & 0.67 & 0.59\\
+    & rf & 0.364 & 0.851 & 0.243 & 0.533 & 0.741 & 0.149 & 0.432 & 8 & 7 & 40 & 14 & 0.65 & 0.61\\
+    & reglog & 0.227 & 0.766 & -0.007 & 0.312 & 0.679 & 0.234 & 0.263 & 5 & 11 & 36 & 17 & 0.49 & 0.48\\
+\addlinespace
+16    & rrlda & 0.000 & 1.000 & NaN & NaN & 0.671 & 0.000 & 0.000 & 0 & 0 & 47 & 23 & 0.48 & 0.61\\
+    & nb & 0.652 & 0.617 & 0.253 & 0.455 & 0.784 & 0.383 & 0.536 & 15 & 18 & 29 & 8 & 0.68 & 0.55\\
+    & rf & 0.261 & 0.851 & 0.135 & 0.462 & 0.702 & 0.149 & 0.333 & 6 & 7 & 40 & 17 & 0.65 & 0.69\\
+    & reglog & 0.391 & 0.723 & 0.116 & 0.409 & 0.708 & 0.277 & 0.400 & 9 & 13 & 34 & 14 & 0.64 & 0.47\\
+\addlinespace
+19    & rrlda & 0.533 & 0.391 & -0.075 & 0.364 & 0.562 & 0.609 & 0.432 & 24 & 42 & 27 & 21 & 0.47 & 0.41\\
+    & nb & 0.489 & 0.565 & 0.053 & 0.423 & 0.629 & 0.435 & 0.454 & 22 & 30 & 39 & 23 & 0.54 & 0.48\\
+    & rf & 0.244 & 0.739 & -0.018 & 0.379 & 0.600 & 0.261 & 0.297 & 11 & 18 & 51 & 34 & 0.54 & 0.52\\
+    & reglog & 0.267 & 0.754 & 0.023 & 0.414 & 0.612 & 0.246 & 0.324 & 12 & 17 & 52 & 33 & 0.51 & 0.32\\
+\bottomrule{}
+\end{tabular}
+
+\end{table}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{dataset1_nb_feature_exploration}
+    \caption{dataset1-nb-feature-exploration}
+    \label{fig:dataset1-nb-feature-exploration}
+\end{figure}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{dataset2_nb_feature_exploration}
+    \caption{dataset2-nb-feature-exploration}
+    \label{fig:dataset2-nb-feature-exploration}
+\end{figure}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{second_visit_change1}
+    \caption{second-visit-change1}
+    \label{fig:second-visit-change1}
+\end{figure}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{second_visit_change2}
+    \caption{second-visit-change1}
+    \label{fig:second-visit-change1}
+\end{figure}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{cor_dataset1}
+    \caption{cor-dataset1}
+    \label{fig:cor-dataset1}
+\end{figure}
+
+\begin{figure}[htpb]
+    \centering
+    \includegraphics[width=\textwidth]{cor_dataset2}
+    \caption{cor-dataset2}
+    \label{fig:cor-dataset2}
+\end{figure}
+
+
+\end{document}
author	Mike Vink <mike1994vink@gmail.com>	2021-05-02 17:46:19 +0200
committer	Mike Vink <mike1994vink@gmail.com>	2021-05-02 17:46:19 +0200
commit	db35b461da9db8c1b4cc6a09245a9771fd956eeb (patch)
tree	858171bcd369bb9189b3857478ba7ba6b93d32bb /data_preparation_modeling/main.tex
parent	bf1adece8aeb48e136085233d2f5ff2f9600eaf5 (diff)