summaryrefslogtreecommitdiff
path: root/data_understanding
diff options
context:
space:
mode:
authorMike Vink <mike1994vink@gmail.com>2021-04-20 23:29:04 +0200
committerMike Vink <mike1994vink@gmail.com>2021-04-20 23:29:04 +0200
commit4c3bbd54b8cfb45cd59666cade8b6ee5b18075ac (patch)
tree8dcf5425cb9ea95deee7e3ba5da6b4ed6dba9299 /data_understanding
parent3944d5be52441755eabf357d6b3fcbc1d6779211 (diff)
check
Diffstat (limited to 'data_understanding')
-rw-r--r--data_understanding/main.log81
-rw-r--r--data_understanding/main.pdfbin139495 -> 121689 bytes
-rw-r--r--data_understanding/main.run.xml4
-rw-r--r--data_understanding/main.tex100
4 files changed, 145 insertions, 40 deletions
diff --git a/data_understanding/main.log b/data_understanding/main.log
index 45360cb..4237119 100644
--- a/data_understanding/main.log
+++ b/data_understanding/main.log
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020/Arch Linux) (preloaded format=pdflatex 2021.4.18) 19 APR 2021 15:21
+This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020/Arch Linux) (preloaded format=pdflatex 2021.4.18) 20 APR 2021 15:57
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
@@ -820,32 +820,67 @@ Package hyperref Info: bookmark level for unknown lstlisting defaults to 0 on i
nput line 43.
LaTeX Font Info: Font shape `OT1/cmtt/bx/n' in size <10> not available
(Font) Font shape `OT1/cmtt/m/n' tried instead on input line 44.
- [2] [3] [4] (./main.aux)
+
+
+LaTeX Warning: Citation 'tomicFluPRINTDatasetMultidimensional2019' on page 2 un
+defined on input line 81.
+
+
+LaTeX Warning: Citation 'tomicSIMONAutomatedMachine2019' on page 2 undefined on
+ input line 83.
+
+[2]
+
+LaTeX Warning: Citation 'chattopadhyaySinglecellTechnologiesMonitoring2014' on
+page 3 undefined on input line 130.
+
+
+LaTeX Warning: Citation 'galliEndOmicsHigh2019' on page 3 undefined on input li
+ne 144.
+
+[3]
+
+LaTeX Warning: Citation 'simoniMassCytometryPowerful2018' on page 4 undefined o
+n input line 166.
+
+
+LaTeX Warning: Empty bibliography on input line 187.
+
+[4] (./main.aux)
Package rerunfilecheck Info: File `main.out' has not changed.
-(rerunfilecheck) Checksum: 704777A26054F7379D922AFB0865A143;1000.
+(rerunfilecheck) Checksum: 5FCAD57201DE24963A12A273942BBD89;375.
+
+
+LaTeX Warning: There were undefined references.
+
+
+Package biblatex Warning: Please (re)run Biber on the file:
+(biblatex) main
+(biblatex) and rerun LaTeX afterwards.
+
Package logreq Info: Writing requests to 'main.run.xml'.
)
Here is how much of TeX's memory you used:
- 19947 strings out of 479383
- 331336 string characters out of 5875830
- 1227405 words of memory out of 5000000
- 36719 multiletter control sequences out of 15000+600000
- 409628 words of font info for 49 fonts, out of 8000000 for 9000
+ 19896 strings out of 479383
+ 329957 string characters out of 5875830
+ 1208475 words of memory out of 5000000
+ 36678 multiletter control sequences out of 15000+600000
+ 409316 words of font info for 48 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
- 98i,6n,100p,1658b,1572s stack positions out of 5000i,500n,10000p,200000b,80000s
-</usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb></usr/share/
-texmf-dist/fonts/type1/public/amsfonts/cm/cmbx12.pfb></usr/share/texmf-dist/fon
-ts/type1/public/amsfonts/cm/cmcsc10.pfb></usr/share/texmf-dist/fonts/type1/publ
-ic/amsfonts/cm/cmmi10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm
-/cmr10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr12.pfb></us
-r/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr17.pfb></usr/share/texmf-d
-ist/fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-dist/fonts/type
-1/public/amsfonts/cm/cmti10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfo
-nts/cm/cmtt10.pfb>
-Output written on main.pdf (4 pages, 139495 bytes).
+ 98i,6n,100p,1068b,1446s stack positions out of 5000i,500n,10000p,200000b,80000s
+{/usr/share/texmf-dist/fonts/enc/dvips/cm-super/cm-super-ts1.enc}</usr/share/
+texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb></usr/share/texmf-dist/fon
+ts/type1/public/amsfonts/cm/cmbx12.pfb></usr/share/texmf-dist/fonts/type1/publi
+c/amsfonts/cm/cmmi10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/
+cmr10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr12.pfb></usr
+/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr17.pfb></usr/share/texmf-di
+st/fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-dist/fonts/type1
+/public/amsfonts/cm/cmtt10.pfb></usr/share/texmf-dist/fonts/type1/public/cm-sup
+er/sfrm1000.pfb>
+Output written on main.pdf (4 pages, 121689 bytes).
PDF statistics:
- 148 PDF objects out of 1000 (max. 8388607)
- 130 compressed objects within 2 object streams
- 25 named destinations out of 1000 (max. 500000)
- 81 words of extra memory for PDF output out of 10000 (max. 10000000)
+ 104 PDF objects out of 1000 (max. 8388607)
+ 88 compressed objects within 1 object stream
+ 16 named destinations out of 1000 (max. 500000)
+ 41 words of extra memory for PDF output out of 10000 (max. 10000000)
diff --git a/data_understanding/main.pdf b/data_understanding/main.pdf
index a971395..61cf069 100644
--- a/data_understanding/main.pdf
+++ b/data_understanding/main.pdf
Binary files differ
diff --git a/data_understanding/main.run.xml b/data_understanding/main.run.xml
index 9bab742..0d76913 100644
--- a/data_understanding/main.run.xml
+++ b/data_understanding/main.run.xml
@@ -41,7 +41,7 @@
>
]>
<requests version="1.0">
- <internal package="biblatex" priority="9" active="0">
+ <internal package="biblatex" priority="9" active="1">
<generic>latex</generic>
<provides type="dynamic">
<file>main.bcf</file>
@@ -62,7 +62,7 @@
<file>english.lbx</file>
</requires>
</internal>
- <external package="biblatex" priority="5" active="0">
+ <external package="biblatex" priority="5" active="1">
<generic>biber</generic>
<cmdline>
<binary>biber</binary>
diff --git a/data_understanding/main.tex b/data_understanding/main.tex
index 9879cb5..5e30de2 100644
--- a/data_understanding/main.tex
+++ b/data_understanding/main.tex
@@ -15,7 +15,7 @@
\section{Initial data collection}
-\subsection{Technicalities}
+\subsection{Technical description data collection}
\subsubsection{MySQL database set up and data import}
@@ -55,20 +55,90 @@ using \lstinline{php bin/import.php}.
\subsection{Data Requirements}
-The following subsections will list the attributes required from the data per
-data mining goal.
-
-\subsubsection{Explore and describe SQL queries and corresponding csv tables.}
-
-\subsubsection{Model and visualise the different clinical study populations.}
-
-\subsubsection{Model and visualise the difference between vaccination types.}
-
-\subsubsection{Model and visualise repeat vaccination effects.}
-
-\subsubsection{Apply standard feature selection methods to the most interesting dataset.}
-
-\subsubsection{Fit classification models to the most interesting dataset.}
+The following subsections will list the information required from the data per
+data mining goals that are needed to answer the following business questions:
+
+\begin{itemize}
+ \item Which datasets in the FluPrint database are most interesting?
+ \item How do different clinical studies compare?
+ \item What are the differences in efficacy between vaccination types?
+ \item What is the effect of repeat vaccination on vaccine response?
+ \item What immunological factors correlate to a high vaccine response?
+\end{itemize}
+
+\subsubsection{Requirements per data mining goal}
+
+\begin{displayquote}
+"Explore and describe SQL queries and corresponding csv tables."
+\end{displayquote}
+
+Falling under this data mining objective are the outputs and tasks related to
+data collection and description. These comprise a report on the initial
+collection of the data, selection of data, and description of general
+properties of the data. The data in this case is in a database format, thus
+here we describe the tables, keys, and attributes in the database, and also
+include descriptive statistics about the data. The goal is to replicate the
+description done in \cite{tomicFluPRINTDatasetMultidimensional2019} as well.
+Using these descriptions we provide insight into which datasets in the database
+are most interesting, and why in \cite{tomicSIMONAutomatedMachine2019} one
+dataset in particular was chosen.
+
+\begin{displayquote}
+"Model and visualise the different clinical study populations."
+\end{displayquote}
+\begin{displayquote}
+"Model and visualise the difference between vaccination types."
+\end{displayquote}
+\begin{displayquote}
+"Model and visualise repeat vaccination effects."
+\end{displayquote}
+
+In order to answer the business question "How do clinical studies compare?"
+subpopulations and groups of attributes need to be visualised and compared
+across different clinical studies. The data required must have rows
+corresponding to donors in a particular clinical study and columns that are
+attributes of tables in the database, these could be biological assay results
+or information about the donors. Thus we aimed to export one csv from the
+database per clinical study by querying for different clinical studies.
+
+We aimed to generate csv files of donors corresponding to received vaccine
+types to answer the business question "What are the differences in efficacy
+between vaccination types?". One simple method to indicate the difference
+between vaccines would be to report the proportion of high-reponders across all
+donors, or to use a simple model to find the best predictor for a high
+response. These comparisons require one table per vaccine type, with rows
+corresponding to donors and columns that include the vaccine response
+classification, in addition to other immune assay and donor attributes.
+
+The objective in question "What is the effect of repeat vaccination on vaccine
+response?" requires data from long running clinical studies. One dataset that
+is used by the database authors and was investigated to answer this question
+was already available, here we aimed to describe and visualise any patterns we
+could find in this dataset and other long running clinical study datasets. This
+required data from a subset of clinical studies that spanned multiple years, at
+this point in the project the data for these clinical studies should have been
+available, and we just had to choose those that spanned multiple years.
+
+\begin{displayquote}
+"Apply standard feature selection methods to the most interesting dataset."
+\end{displayquote}
+
+\begin{displayquote}
+"Fit classification models to the most interesting dataset."
+\end{displayquote}
+
+These last two data mining objectives were chosen to comprise the data
+preparation and modelling phases of this project. The authors of fluprint set
+up an automated machine learning pipeline to investigate the longest running
+clinical study dataset in the database. In this work we use a conventional data
+mining modelling process to replicate these results. This dataset contains any
+immunological assay results for all donors in the clinical study on their first
+vist, and their classification as a high or low vaccine responder. To fulfill
+the above two data mining goals, we used this dataset.
+
+\section{Data description}
+
+\subsection{Volumetric analysis}
\citep{chattopadhyaySinglecellTechnologiesMonitoring2014}
The complex heterogeneity of cells, and their interconnectedness with each