diff options
| author | Mike Vink <mike1994vink@gmail.com> | 2021-04-20 23:29:04 +0200 |
|---|---|---|
| committer | Mike Vink <mike1994vink@gmail.com> | 2021-04-20 23:29:04 +0200 |
| commit | 4c3bbd54b8cfb45cd59666cade8b6ee5b18075ac (patch) | |
| tree | 8dcf5425cb9ea95deee7e3ba5da6b4ed6dba9299 | |
| parent | 3944d5be52441755eabf357d6b3fcbc1d6779211 (diff) | |
check
| -rw-r--r-- | data_understanding/main.log | 81 | ||||
| -rw-r--r-- | data_understanding/main.pdf | bin | 139495 -> 121689 bytes | |||
| -rw-r--r-- | data_understanding/main.run.xml | 4 | ||||
| -rw-r--r-- | data_understanding/main.tex | 100 |
4 files changed, 145 insertions, 40 deletions
diff --git a/data_understanding/main.log b/data_understanding/main.log index 45360cb..4237119 100644 --- a/data_understanding/main.log +++ b/data_understanding/main.log @@ -1,4 +1,4 @@ -This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020/Arch Linux) (preloaded format=pdflatex 2021.4.18) 19 APR 2021 15:21 +This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020/Arch Linux) (preloaded format=pdflatex 2021.4.18) 20 APR 2021 15:57 entering extended mode restricted \write18 enabled. %&-line parsing enabled. @@ -820,32 +820,67 @@ Package hyperref Info: bookmark level for unknown lstlisting defaults to 0 on i nput line 43. LaTeX Font Info: Font shape `OT1/cmtt/bx/n' in size <10> not available (Font) Font shape `OT1/cmtt/m/n' tried instead on input line 44. - [2] [3] [4] (./main.aux) + + +LaTeX Warning: Citation 'tomicFluPRINTDatasetMultidimensional2019' on page 2 un +defined on input line 81. + + +LaTeX Warning: Citation 'tomicSIMONAutomatedMachine2019' on page 2 undefined on + input line 83. + +[2] + +LaTeX Warning: Citation 'chattopadhyaySinglecellTechnologiesMonitoring2014' on +page 3 undefined on input line 130. + + +LaTeX Warning: Citation 'galliEndOmicsHigh2019' on page 3 undefined on input li +ne 144. + +[3] + +LaTeX Warning: Citation 'simoniMassCytometryPowerful2018' on page 4 undefined o +n input line 166. + + +LaTeX Warning: Empty bibliography on input line 187. + +[4] (./main.aux) Package rerunfilecheck Info: File `main.out' has not changed. -(rerunfilecheck) Checksum: 704777A26054F7379D922AFB0865A143;1000. +(rerunfilecheck) Checksum: 5FCAD57201DE24963A12A273942BBD89;375. + + +LaTeX Warning: There were undefined references. + + +Package biblatex Warning: Please (re)run Biber on the file: +(biblatex) main +(biblatex) and rerun LaTeX afterwards. + Package logreq Info: Writing requests to 'main.run.xml'. ) Here is how much of TeX's memory you used: - 19947 strings out of 479383 - 331336 string characters out of 5875830 - 1227405 words of memory out of 5000000 - 36719 multiletter control sequences out of 15000+600000 - 409628 words of font info for 49 fonts, out of 8000000 for 9000 + 19896 strings out of 479383 + 329957 string characters out of 5875830 + 1208475 words of memory out of 5000000 + 36678 multiletter control sequences out of 15000+600000 + 409316 words of font info for 48 fonts, out of 8000000 for 9000 1141 hyphenation exceptions out of 8191 - 98i,6n,100p,1658b,1572s stack positions out of 5000i,500n,10000p,200000b,80000s -</usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb></usr/share/ -texmf-dist/fonts/type1/public/amsfonts/cm/cmbx12.pfb></usr/share/texmf-dist/fon -ts/type1/public/amsfonts/cm/cmcsc10.pfb></usr/share/texmf-dist/fonts/type1/publ -ic/amsfonts/cm/cmmi10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm -/cmr10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr12.pfb></us -r/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr17.pfb></usr/share/texmf-d -ist/fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-dist/fonts/type -1/public/amsfonts/cm/cmti10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfo -nts/cm/cmtt10.pfb> -Output written on main.pdf (4 pages, 139495 bytes). + 98i,6n,100p,1068b,1446s stack positions out of 5000i,500n,10000p,200000b,80000s +{/usr/share/texmf-dist/fonts/enc/dvips/cm-super/cm-super-ts1.enc}</usr/share/ +texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb></usr/share/texmf-dist/fon +ts/type1/public/amsfonts/cm/cmbx12.pfb></usr/share/texmf-dist/fonts/type1/publi +c/amsfonts/cm/cmmi10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/ +cmr10.pfb></usr/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr12.pfb></usr +/share/texmf-dist/fonts/type1/public/amsfonts/cm/cmr17.pfb></usr/share/texmf-di +st/fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-dist/fonts/type1 +/public/amsfonts/cm/cmtt10.pfb></usr/share/texmf-dist/fonts/type1/public/cm-sup +er/sfrm1000.pfb> +Output written on main.pdf (4 pages, 121689 bytes). PDF statistics: - 148 PDF objects out of 1000 (max. 8388607) - 130 compressed objects within 2 object streams - 25 named destinations out of 1000 (max. 500000) - 81 words of extra memory for PDF output out of 10000 (max. 10000000) + 104 PDF objects out of 1000 (max. 8388607) + 88 compressed objects within 1 object stream + 16 named destinations out of 1000 (max. 500000) + 41 words of extra memory for PDF output out of 10000 (max. 10000000) diff --git a/data_understanding/main.pdf b/data_understanding/main.pdf Binary files differindex a971395..61cf069 100644 --- a/data_understanding/main.pdf +++ b/data_understanding/main.pdf diff --git a/data_understanding/main.run.xml b/data_understanding/main.run.xml index 9bab742..0d76913 100644 --- a/data_understanding/main.run.xml +++ b/data_understanding/main.run.xml @@ -41,7 +41,7 @@ > ]> <requests version="1.0"> - <internal package="biblatex" priority="9" active="0"> + <internal package="biblatex" priority="9" active="1"> <generic>latex</generic> <provides type="dynamic"> <file>main.bcf</file> @@ -62,7 +62,7 @@ <file>english.lbx</file> </requires> </internal> - <external package="biblatex" priority="5" active="0"> + <external package="biblatex" priority="5" active="1"> <generic>biber</generic> <cmdline> <binary>biber</binary> diff --git a/data_understanding/main.tex b/data_understanding/main.tex index 9879cb5..5e30de2 100644 --- a/data_understanding/main.tex +++ b/data_understanding/main.tex @@ -15,7 +15,7 @@ \section{Initial data collection} -\subsection{Technicalities} +\subsection{Technical description data collection} \subsubsection{MySQL database set up and data import} @@ -55,20 +55,90 @@ using \lstinline{php bin/import.php}. \subsection{Data Requirements} -The following subsections will list the attributes required from the data per -data mining goal. - -\subsubsection{Explore and describe SQL queries and corresponding csv tables.} - -\subsubsection{Model and visualise the different clinical study populations.} - -\subsubsection{Model and visualise the difference between vaccination types.} - -\subsubsection{Model and visualise repeat vaccination effects.} - -\subsubsection{Apply standard feature selection methods to the most interesting dataset.} - -\subsubsection{Fit classification models to the most interesting dataset.} +The following subsections will list the information required from the data per +data mining goals that are needed to answer the following business questions: + +\begin{itemize} + \item Which datasets in the FluPrint database are most interesting? + \item How do different clinical studies compare? + \item What are the differences in efficacy between vaccination types? + \item What is the effect of repeat vaccination on vaccine response? + \item What immunological factors correlate to a high vaccine response? +\end{itemize} + +\subsubsection{Requirements per data mining goal} + +\begin{displayquote} +"Explore and describe SQL queries and corresponding csv tables." +\end{displayquote} + +Falling under this data mining objective are the outputs and tasks related to +data collection and description. These comprise a report on the initial +collection of the data, selection of data, and description of general +properties of the data. The data in this case is in a database format, thus +here we describe the tables, keys, and attributes in the database, and also +include descriptive statistics about the data. The goal is to replicate the +description done in \cite{tomicFluPRINTDatasetMultidimensional2019} as well. +Using these descriptions we provide insight into which datasets in the database +are most interesting, and why in \cite{tomicSIMONAutomatedMachine2019} one +dataset in particular was chosen. + +\begin{displayquote} +"Model and visualise the different clinical study populations." +\end{displayquote} +\begin{displayquote} +"Model and visualise the difference between vaccination types." +\end{displayquote} +\begin{displayquote} +"Model and visualise repeat vaccination effects." +\end{displayquote} + +In order to answer the business question "How do clinical studies compare?" +subpopulations and groups of attributes need to be visualised and compared +across different clinical studies. The data required must have rows +corresponding to donors in a particular clinical study and columns that are +attributes of tables in the database, these could be biological assay results +or information about the donors. Thus we aimed to export one csv from the +database per clinical study by querying for different clinical studies. + +We aimed to generate csv files of donors corresponding to received vaccine +types to answer the business question "What are the differences in efficacy +between vaccination types?". One simple method to indicate the difference +between vaccines would be to report the proportion of high-reponders across all +donors, or to use a simple model to find the best predictor for a high +response. These comparisons require one table per vaccine type, with rows +corresponding to donors and columns that include the vaccine response +classification, in addition to other immune assay and donor attributes. + +The objective in question "What is the effect of repeat vaccination on vaccine +response?" requires data from long running clinical studies. One dataset that +is used by the database authors and was investigated to answer this question +was already available, here we aimed to describe and visualise any patterns we +could find in this dataset and other long running clinical study datasets. This +required data from a subset of clinical studies that spanned multiple years, at +this point in the project the data for these clinical studies should have been +available, and we just had to choose those that spanned multiple years. + +\begin{displayquote} +"Apply standard feature selection methods to the most interesting dataset." +\end{displayquote} + +\begin{displayquote} +"Fit classification models to the most interesting dataset." +\end{displayquote} + +These last two data mining objectives were chosen to comprise the data +preparation and modelling phases of this project. The authors of fluprint set +up an automated machine learning pipeline to investigate the longest running +clinical study dataset in the database. In this work we use a conventional data +mining modelling process to replicate these results. This dataset contains any +immunological assay results for all donors in the clinical study on their first +vist, and their classification as a high or low vaccine responder. To fulfill +the above two data mining goals, we used this dataset. + +\section{Data description} + +\subsection{Volumetric analysis} \citep{chattopadhyaySinglecellTechnologiesMonitoring2014} The complex heterogeneity of cells, and their interconnectedness with each |
