If you want to download or save this thesis you can use the pdf link or the epub format too. The thesis has many links to make it easier to reach external resources, if printed they will display as blue text.

B Software

In the years since this thesis was first embarked upon, several software pieces have been developed, including some packages. We listed them here for easier retrieval. They are provided below in two ways, the first with a brief explanation and the second ordered by the relevant software used for each analysis.

B.1 STAR

The parameters and options used with STAR are as follows:

STAR \
    --outSAMtype BAM SortedByCoordinate \
    --outFilterIntronMotifs RemoveNoncanonical \
    --outSAMattributes All \
    --outReadsUnmapped Fastx \
    --outSAMstrandField intronMotif \
    --outFilterScoreMinOverLread 0.5 \
    --outFilterMatchNminOverLread 0.5 \
    --outFilterType BySJout \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1 \
    --outFilterMismatchNmax 999 \
    --outFilterMismatchNoverLmax 0.04 \
    --genomeDir "$genome/STAR" \
    --limitBAMsortRAM 10000000000 \
    --runMode alignReads \
    --genomeLoad NoSharedMemory \
    --quantMode TranscriptomeSAM \
    --outFileNamePrefix $output \
    --runThreadN "$threads" \
    --readFilesCommand zcat \
    --readFilesIn "$file1" "$file2"

The $genome is the path to the location on the computer where the genome is located, $output is the prefix of the output file, $threads is the number of threads used and $file1 and $file2 are the paired fastq files.

B.2 RSEM

The code used for RSEM where $threads is the number of threads used, $rseminp is the input file in BAM format, $genome is the path to the location on the computer where the genome is located, and $rsem is the output file.

rsem-calculate-expression \
    --quiet \
    --paired-end \
    -p "$threads"  \
    --estimate-rspd \
    --append-names \
    --no-bam-output \
    --bam "$rseminp" "$genome/RSEM/RSEM" "$rsem"

B.3 Listed

An improved/tested version of RGCCA, includes some modifications concerning internal functions to simplify the maintenance, as well as, additional tests and better documentation. Moreover, it has been modified such that it is possible to use a vector of models whereby the model of the first dimension is not the same as the model on the second dimension (Mathematically speaking, we cannot attest to its coherence, but from a biological standpoint we believe such a version of RGCCA might prove very useful).

We coded the package inteRmodel to make the bootstrapping and model selection for RGCCA easier and more readily accessible.

A package to assist in batch design in order to avoid batch effects - see experDesign and its corresponding website on GitHub.

Explore the effects of hyperparameters on RGCCA on the provided dataset of gliomaData (originally provided here) available at the sgcca_hyperparameters repository.

We utilized a pouchitis cohort published in this article[150], which was used to compare the effectiveness of our method with other’s dataset. The code used can be found at this repository.

Some functions used to explore the TRIM dataset were incorporated into the integration package.This includes functions for correlation, network analysis, enrichment, and normalization of metadata, among other components…

We developed a package to analyze both sets and fuzzy sets; see BaseSet, which is based on what we learned from a previous iteration of the GSEAdv package. This package was intended to be used with those probabilities that arise from bootstrapping the models. However, due to the prolonged calculation times required, ultimately it was not used.

To analyze the BARCELONA cohort (also designated antiTNF) a different repository was created in order to analyze the data using the previously developed packages.

B.4 By project/publication

All of the code underlying our analysis of the publications is available (in its messed state and complicated history) as well as a brief description of the code used:

Multi-omic modelling of inflammatory bowel disease with regularized canonical correlation analysis:

TRIM: Data cleaning with the sample, dataset, explore several methods…
Puget’s: Explore the effects of the hyperparameters on RGCCA with the provided dataset.
inteRmodel: Package for easily reproducing the methodology developed with TRIM.
Morgan’s: Work with the pouchitis cohort used in this article.
Häsler’s: Work with the UC/CD dataset used in this article.
integration: Package that incorporates functions we wrote or used for the different aspects of exploring the TRIM dataset are published here.

BaseSet:

BaseSet: Fuzzy logic implementation, available at rOpenSci too see also its corresponding documentation website .

experDesign:

experDesign: Can assist in the design of batch experiments; also with a documentation website too.

BARCELONA:

BARCELONA: Code for analyzing the BARCELONA’s dataset

Validation:

Howell’s: Code to work with Howell’s 2018 dataset.
Cristian’s: Code to work with Cristian’s 2020 dataset.

References

150. Morgan XC, Kabakchiev B, Waldron L, Tyler AD, Tickle TL, Milgrom R, et al. Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biology. 2015;16:67.