summaryrefslogtreecommitdiff
path: root/data_understanding/main.tex
diff options
context:
space:
mode:
authorMike Vink <mike1994vink@gmail.com>2021-04-28 18:21:50 +0200
committerMike Vink <mike1994vink@gmail.com>2021-04-28 18:21:50 +0200
commitde4565fe9290ec1f1031eed6f7d067794df53166 (patch)
tree1211ea302fc77e4ce51a7bd88c6eff8abee3eb85 /data_understanding/main.tex
parent2676115f77f0052902e1dcc0632420341464373d (diff)
begin data prep
Diffstat (limited to 'data_understanding/main.tex')
-rw-r--r--data_understanding/main.tex40
1 files changed, 27 insertions, 13 deletions
diff --git a/data_understanding/main.tex b/data_understanding/main.tex
index 4f0b965..e9bfcfa 100644
--- a/data_understanding/main.tex
+++ b/data_understanding/main.tex
@@ -374,7 +374,12 @@ The example of donor 166 contains an inconsistency in the classification, in
as a low responder \autoref{tbl:visit166}. Because of this the seasonal
classification of donors was investigated using the seroprotection and
seroconversion criteria \ref{fig:seasonalClasses}, records of incorrectly
-labelled donors are also saved as a spreadsheet.
+labelled donors are also saved as a spreadsheet. This data is inconsistent in
+the database, but the most likely explanation is that antibody titer for one
+strain of virus did not meat teh seroprotection criterium. It is inconsistent
+because per strain titer data is not given, but it is therefore not necessarily
+incorrect. Hence the classification will be used in this work without further
+selection.
\subsubsection{Experimental data table}
@@ -443,11 +448,18 @@ The number of features collected for each visit is large and varies greatly
every assay is done in every clinical study \autoref{fig:featureNumbers} and
over the years the data generated by assays has changed, so a table with all
features as columns and all donors as rows would be extremely sparse (and
-crashes R due to RAM limitations). Describing the 3285 different
-features in this sparse table would be impossible, but assay value
-distributions across studies are shown to follow normal or power distributions
-\autoref{fig:assayDistr}. Global correlation analysis is complicated by the
-great number of features and sparseness in the data.
+crashes R due to RAM limitations). Describing the 3285 different features in
+this sparse table would be impossible, but assay value distributions across
+studies are shown to follow normal or power distributions
+\autoref{fig:assayDistr}. The features included 102 blood-derived immune cell
+subsets analyzed by mass cytometry. It also included the signaling capacity of
+over 30 immune cells subsets stimulated with seven conditions, which were
+evaluated by measuring the phosphorylation of nine proteins. Additionally, up
+to 50 serum analytes were evaluated using Luminex bead arrays
+\citep{tomicSIMONAutomatedMachine2019}.
+
+No correlation analysis was done, since this is complicated by the great number
+of features and sparseness in the data.
\begin{figure}
\includegraphics[width=\textwidth]{repeat_visits_per_study}
@@ -477,13 +489,15 @@ the vaccine response to null if there is not enough assay data measured.
The database has issues that are inherent to combining multiple studies and the
classification is inconsistent in some cases \autoref{fig:classInconsistent},
or often missing completely because no HAI antibody assay data was available or
-the classification was set to a null value by the database authors
-\autoref{fig:repeatVisits}. The main value of the database is the assay data
-that is fully represented in all studies and across all years, but this
-information is hard to access since all studies do not use overlapping assays
-\autoref{fig:featureNumbers}, resulting in high sparsity data. Further, the
-sample size that can be used for further studies is limitted, since the high
-versus low vaccine response is only available for a small subset of the data.
+the classification was set to a null value by the database authors because
+possibly the antibody titer for a single strain of virus in the vaccine was too
+low (this data is not in the database) \autoref{fig:repeatVisits}. The main value of
+the database is the assay data that is fully represented in all studies and
+across all years, but this information is hard to access since all studies do
+not use overlapping assays \autoref{fig:featureNumbers}, resulting in high
+sparsity data. Further, the sample size that can be used for further studies is
+limitted, since the high versus low vaccine response is only available for a
+small subset of the data.
Specific attributes that have great amounts of missing values are the
virological and HAI assay data, the last is used for the vaccine response