summaryrefslogtreecommitdiff
path: root/data_preparation_modeling/main.tex
blob: 4b72ed4eded431034a9022b760407a5fa8b168fa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
% hello
\input{../preamble.tex}

\makeglossaries
\input{../bussiness_glossary.tex}
\input{../data_mining_glossary.tex}
\input{../acronyms.tex}

\begin{document}
\MyTitle{Data Preparation and Modelling Report}
\tableofcontents
\printglossary[type=bus]
\printglossary[type=dm]
\printglossary[type=\acronymtype]

\section{Data selection}

The data that we use in this work is based on the data used in the simon
manuscript \citep{tomicSIMONAutomatedMachine2019}. This subset of the fluprint
database comprises data from 5 clinical studies, most importantly the
longitudal study SLVP015. It only uses the first visits of donors, as the
classification is the most complete in this dataset
\autoref{fig:dataRepeatvisits}. We will use this dataset to model the high
vaccine response with as predictors the in total 3285 features measured in
assays done by the clinical studies. The data will be prepared in the same way
as in the original work, using the mulset algorithm
\citep{tomicSIMONAutomatedMachine2019}.

\begin{figure}
    \includegraphics[width=\textwidth]{data_selection}
    \caption{caption}\label{fig:dataRepeatvisits}
\end{figure}

In addition to repeating a similar procedure as in
\cite{tomicSIMONAutomatedMachine2019}, we will compare the values of features
selected by the models trained on the first visit data to those of second visit
data. Initially the plan was to train new models on the second visit data,
however the classes are extremely unbalanced in the data of repeat visits
\autoref{fig:dataRepeatvisits}. For example in the first visit there are 65
high responders and 130 low responders, in the second visit there is only data
available for 6 high responders and 44 low responders. Therefore we will only
train model on first visit data, and use the knowledge gained to explore second
visit data.

\section{Clean data}

The goal is to obtain one or more tables from the simon data suitable to train
models, thus we are looking to change the data from the long format as in the
database in a wide format where each column is a features measured in an assay.
When attempting to do this it was discovered that some assay data contained
duplicate readouts \autoref{tbl:exampleDuplicate}. Since the values were all
similar it was decided to aggregate the values to unique features using the
mean value.

\begin{table}
\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed
\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed
\begin{tabular}{rrrrrlrllrrl}
\toprule{}
donor\_id & study & age & outcome & year & type & hai\_response & name & data\_name & assay & data & dup\\
\midrule{}
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.8 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.1 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.3 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.0 & TRUE\\
\bottomrule{}
\end{tabular}
    \caption{}\label{tbl:exampleDuplicate}
\end{table}

The obtained first visit wide data had dimensions of 195x3284 (donors by
measured features), with 596736 missing value cells (93\% sparsity). The second
visit data had dimensions of 50x3251, with a lower sparsity (58\%) since the donor
population is smaller and they are from the same clinical study.

\subsection{Mulset alogrithm}

Following the procedure in \cite{tomicSIMONAutomatedMachine2019} we deal with
this data sparsity by applying the mulset algorithm created by
\cite{tomicSIMONAutomatedMachine2019}. This is necessary since data is missing
in every column and the lack of prior knowledge doesn't allow for imputation of
missing values, precluding conventional measures of missing data cleaning.

\begin{figure}[ht]
    \includegraphics[width=\textwidth]{F2.large}
    \caption{\textbf{taken from original work}}\label{fig:mulsetAlg}
\end{figure}

The mulset algorithm uses the intersection of features sets of donors to
calculate pairwise shared feature sets. For every shared feature set it then
retrieves all donors that have values for these features \autoref{fig:mulsetAlg}.

\begin{lstlisting}[caption=Applying the mulset algorithm, label={lst:mulsetStep}]
% Step 1: generate re-sampled intersection datasets suitable for analysis
for {each subject in data} do:
	Calculate intersection between subject and all other subjects using mulset algorithm
	Skip sets that have less than 5 features and less than 15 donors in common
end for;
# Save all shared intersections to corresponding datasets
\end{lstlisting}

Applying the algorithm resulted in 47 different datasets without missing values
which contained a subset of donors and features, and the vaccine response
classification. Further, the same criteria as in the original work were
applied, the number of datasets was filtered down to 36 by excluding datasets
that had less than 15 donors or less than 5 features \autoref{lst:mulsetStep}.

To prepare the datasets for modelling they were partitioned into trianing and
test sets consisting of 75\% and 25\% of the data respectively. To ensure that
out of sample point estimates were not based on nonsense, datasets with a test
set containing less than 10 donors/rows were discarded. The resulting number of
cleaned datasets for modelling purposes was 20 \autoref{tbl:mulsetDatasets}. A
significant number of datasets contained more predictors than samples, however
we consider this as an inevitible phenomenon and not an absolute obstacle since
the purpose of the models is not to discriminate vaccine responders with the
highest accuracy, but to select features from that correlate with a vaccine
response from the great number of features.

\begin{table}
    \begin{tabularx}{\textwidth}{XXXXX}
\toprule{}
dataset & Rows x Cols & total (low / high (low \%)) & train (low / high) & test (low / high)\\
\midrule{}
1 & 61 x 78 & 43 / 18 ( 0.7 ) & 33 / 14 & 10 / 4\\
2 & 105 x 101 & 62 / 43 ( 0.59 ) & 47 / 33 & 15 / 10\\
3 & 140 x 50 & 94 / 46 ( 0.67 ) & 71 / 35 & 23 / 11\\
4 & 63 x 269 & 38 / 25 ( 0.6 ) & 29 / 19 & 9 / 6\\
5 & 62 x 293 & 38 / 24 ( 0.61 ) & 29 / 18 & 9 / 6\\
\addlinespace
6 & 68 x 237 & 42 / 26 ( 0.62 ) & 32 / 20 & 10 / 6\\
7 & 67 x 44 & 47 / 20 ( 0.7 ) & 36 / 15 & 11 / 5\\
8 & 111 x 93 & 66 / 45 ( 0.59 ) & 50 / 34 & 16 / 11\\
9 & 73 x 54 & 58 / 15 ( 0.79 ) & 44 / 12 & 14 / 3\\
10 & 40 x 105 & 28 / 12 ( 0.7 ) & 21 / 9 & 7 / 3\\
\addlinespace
11 & 46 x 97 & 32 / 14 ( 0.7 ) & 24 / 11 & 8 / 3\\
12 & 137 x 53 & 78 / 59 ( 0.57 ) & 59 / 45 & 19 / 14\\
13 & 48 x 42 & 35 / 13 ( 0.73 ) & 27 / 10 & 8 / 3\\
14 & 91 x 38 & 62 / 29 ( 0.68 ) & 47 / 22 & 15 / 7\\
15 & 42 x 37 & 36 / 6 ( 0.86 ) & 27 / 5 & 9 / 1\\
\addlinespace
16 & 92 x 26 & 62 / 30 ( 0.67 ) & 47 / 23 & 15 / 7\\
17 & 88 x 6 & 68 / 20 ( 0.77 ) & 51 / 15 & 17 / 5\\
18 & 82 x 87 & 56 / 26 ( 0.68 ) & 42 / 20 & 14 / 6\\
19 & 151 x 51 & 92 / 59 ( 0.61 ) & 69 / 45 & 23 / 14\\
20 & 83 x 75 & 56 / 27 ( 0.67 ) & 42 / 21 & 14 / 6\\
\bottomrule{}
\end{tabularx}
    \caption{caption}\label{tbl:mulsetDatasets}
\end{table}


\section{Modelling}

The modelling techniques of choice were to be resistent to the "too many
features" problem and suitable for selecting features in an embedded based
approach \citep{hiraReviewFeatureSelection2015}. Technically, the approach used
here is a wrapper approach since we are using the mulset algorithm to generate
different subsets of features and training machine learning models on those
features. However, in this work we train three models that have an embedded
mechanism for obtaining the most important predictors of vaccine response. This
is done for every feature set, and manually we chose the best and most
interesting trained models and their obtained features. The end goal was to
identify important features and investigating the change in those features for the
second visit/influenza season of donors.

\begin{table}
\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed
\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed
\begin{tabular}{llrrrrrrrrrrrrr}
\toprule{}
dataset &model & SENS & SPEC & MCC & PREC & NPV & FPR & F1 & TP & FP & TN & FN & train AUC & test AUC\\
\midrule{}
14    & rrlda & 0.091 & 0.915 & 0.010 & 0.333 & 0.683 & 0.085 & 0.143 & 2 & 4 & 43 & 20 & 0.50 & 0.62\\
    & nb & 0.636 & 0.702 & 0.321 & 0.500 & 0.805 & 0.298 & 0.560 & 14 & 14 & 33 & 8 & 0.67 & 0.59\\
    & rf & 0.364 & 0.851 & 0.243 & 0.533 & 0.741 & 0.149 & 0.432 & 8 & 7 & 40 & 14 & 0.65 & 0.61\\
    & reglog & 0.227 & 0.766 & -0.007 & 0.312 & 0.679 & 0.234 & 0.263 & 5 & 11 & 36 & 17 & 0.49 & 0.48\\
\addlinespace
16    & rrlda & 0.000 & 1.000 & NaN & NaN & 0.671 & 0.000 & 0.000 & 0 & 0 & 47 & 23 & 0.48 & 0.61\\
    & nb & 0.652 & 0.617 & 0.253 & 0.455 & 0.784 & 0.383 & 0.536 & 15 & 18 & 29 & 8 & 0.68 & 0.55\\
    & rf & 0.261 & 0.851 & 0.135 & 0.462 & 0.702 & 0.149 & 0.333 & 6 & 7 & 40 & 17 & 0.65 & 0.69\\
    & reglog & 0.391 & 0.723 & 0.116 & 0.409 & 0.708 & 0.277 & 0.400 & 9 & 13 & 34 & 14 & 0.64 & 0.47\\
\addlinespace
19    & rrlda & 0.533 & 0.391 & -0.075 & 0.364 & 0.562 & 0.609 & 0.432 & 24 & 42 & 27 & 21 & 0.47 & 0.41\\
    & nb & 0.489 & 0.565 & 0.053 & 0.423 & 0.629 & 0.435 & 0.454 & 22 & 30 & 39 & 23 & 0.54 & 0.48\\
    & rf & 0.244 & 0.739 & -0.018 & 0.379 & 0.600 & 0.261 & 0.297 & 11 & 18 & 51 & 34 & 0.54 & 0.52\\
    & reglog & 0.267 & 0.754 & 0.023 & 0.414 & 0.612 & 0.246 & 0.324 & 12 & 17 & 52 & 33 & 0.51 & 0.32\\
\bottomrule{}
\end{tabular}

\end{table}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{dataset1_nb_feature_exploration}
    \caption{dataset1-nb-feature-exploration}
    \label{fig:dataset1-nb-feature-exploration}
\end{figure}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{dataset2_nb_feature_exploration}
    \caption{dataset2-nb-feature-exploration}
    \label{fig:dataset2-nb-feature-exploration}
\end{figure}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{second_visit_change1}
    \caption{second-visit-change1}
    \label{fig:second-visit-change1}
\end{figure}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{second_visit_change2}
    \caption{second-visit-change1}
    \label{fig:second-visit-change1}
\end{figure}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{cor_dataset1}
    \caption{cor-dataset1}
    \label{fig:cor-dataset1}
\end{figure}

\begin{figure}[htpb]
    \centering
    \includegraphics[width=\textwidth]{cor_dataset2}
    \caption{cor-dataset2}
    \label{fig:cor-dataset2}
\end{figure}


\end{document}