summaryrefslogtreecommitdiff
path: root/data_preparation/main.tex
blob: 5186dc3682e6633626eca3f122c96e3a9c1cc3cc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
% hello
\input{../preamble.tex}

\makeglossaries
\input{../bussiness_glossary.tex}
\input{../data_mining_glossary.tex}
\input{../acronyms.tex}

\begin{document}
\MyTitle{Data Preparation Report}
\tableofcontents
\printglossary[type=bus]
\printglossary[type=dm]
\printglossary[type=\acronymtype]

\section{Data selection}

The data that we use in this work is based on the data used in the simon
manuscript \citep{tomicSIMONAutomatedMachine2019}. This subset of the fluprint
database comprises data from 5 clinical studies, most importantly the
longitudal study SLVP015. It only uses the first visits of donors, as the
classification is the most complete in this dataset
\autoref{fig:dataRepeatvisits}. We will use this dataset to model the high
vaccine response with as predictors the in total 3285 features measured in
assays done by the clinical studies. The data will be prepared in the same way
as in the original work, using the mulset algorithm
\citep{tomicSIMONAutomatedMachine2019}.

\begin{figure}
    \includegraphics[width=\textwidth]{data_selection}
    \caption{caption}\label{fig:dataRepeatvisits}
\end{figure}

In addition to repeating a similar procedure as in
\cite{tomicSIMONAutomatedMachine2019}, we will compare the values of features
selected by the models trained on the first visit data to those of second visit
data. Initially the plan was to train new models on the second visit data,
however the classes are extremely unbalanced in the data of repeat visits
\autoref{fig:dataRepeatvisits}. For example in the first visit there are 65
high responders and 130 low responders, in the second visit there is only data
available for 6 high responders and 44 low responders. Therefore we will only
train model on first visit data, and use the knowledge gained to explore second
visit data.

\section{Clean data}

The goal is to obtain one or more tables from the simon data suitable to train
models, thus we are looking to change the data from the long format as in the
database in a wide format where each column is a features measured in an assay.
When attempting to do this it was discovered that some assay data contained
duplicate readouts \autoref{tbl:exampleDuplicate}. Since the values were all
similar it was decided to aggregate the values to unique features using the
mean value.

\begin{table}
\addtolength{\leftskip} {-2cm} % increase (absolute) value if needed
\addtolength{\rightskip} {-2cm} % increase (absolute) value if needed
\begin{tabular}{rrrrrlrllrrl}
\toprule{}
donor\_id & study & age & outcome & year & type & hai\_response & name & data\_name & assay & data & dup\\
\midrule{}
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.8 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.1 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 34.3 & TRUE\\
285 & 18 & 9.47 & 0 & 2009 & pre & 1 & CD4+ T cells & CD4\_pos\_T\_cells & 13 & 33.0 & TRUE\\
\bottomrule{}
\end{tabular}
    \caption{}\label{tbl:exampleDuplicate}
\end{table}

\end{document}