If you want to download or save this thesis you can use the pdf link or the epub format too. The thesis has many links to make it easier to reach external resources, if printed they will display as blue text.

# Chapter 4 Results

## 4.1 Packages/methods

### 4.1.1 experDesign

experDesign package built in R was released for the first time on CRAN on 2020-09-08 after nearly a year after the initial release made on github. After peer review it was published on a journal on 2021-11 [194].

The package uses functional programming to create and modify objects and the features used. The package bases its performance on the large body of work made by the R core team. It adds the information to the introduced data.frame or returns an vector with the appropriate information.

experDesign functions are divided into several categories:

• Helper functions to aid on deciding how many batches are or how many samples per batch. There are some also that report how good a given distribution of the samples felt for a given dataset.

• Functions generating indexes.

• Functions distributing the samples on indexes

Regarding time related variables experDesign will use them as factors, while issuing a warning to the user.

Since its development it has been used on a couple of RNA sequencing experiments that required a batch design, one of organoids bulk RNAseq (data not related to this thesis) and another one of biopsies bulk RNAseq from the Biologic Assesment of Response and relation to Clinical activity, Endoscopic and radiologic Lesions, in inflammatory bowel disease patients ON Anti TNF-therapies (BARCELONA) cohort (See appendix section D.1). It was also used to check if there is any observable batch effect on the datasets analyzed.

On the designed datasets experDesign avoided batch effects from the sequencing process. However, on the organoids dataset, a change on the matrigel used to produce them introduced a batch effect that made it impossible to compare samples before and after that change (there were not any shared sample before and after the change of matrigel). On the BARCELONA cohort there were other problems described on the appendices (section D.1).

Since its release on CRAN it has had a median of ~400 downloads each month from RStudio repository mirror, showing the interest the community have on solutions like this.

### 4.1.2 BaseSet

BaseSet package, built in R, was released for the first time on CRAN on 2020-11-11, nearly two years after the initial work started on github.

The package uses both functional programming and object oriented program to create and modify the TidySet S4 object defined7. Mixing it with S3 generic functions it provides a powerful interface compatible with the tidyverse principles, a group of packages following the same design. The package provides a new class to handle fuzzy sets and the associate information.

BaseSet methods are divided into several categories:

• General functions to create sets of the TidySet class or convert from it to a list or about the package.

• Set operations like adjacency cartesian product, cardinality, complement, incidence, independence, intersection, union, subtract, power set or size.

• Functions to work with TidySets to add relationships, sets, elements or some complimentary data about them. Remove the same or simply move around data or calculate the number of elements, relations and sets.

• Functions to read files from formats where sets are usually stored in the bioinformatician field: GAF, GMT and OBO formats.

• Last, some utility functions to use set name conventions and other auxiliary functions.

The package had a long development process with initial iterations based on GSEABase package which was later abandoned (GSEAdv) to also include some uncertainty on the relationship of a gene with a given gene set.

The package formed part of an exploration of the Bioconductor community (project to develop, support, and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assay) for more modern and faster handling of sets than GSEABase. There were three different packages created as part of this process, BaseSet, BiocSet released on Bioconductor and unisets, available on github. The three different approaches were presented at a birds of feather on BioC2019, the annual conference of Bioconductor on 2019.

The package passed the review on the rOpenSci organization (See review) and is now part of the packages hosted there too.

Since its release on CRAN it has had a median close to ~400 downloads each month from RStudio package manager.

### 4.1.3 inteRmodel

The package was build once the method used to find accurate models of the relationships of the data available of a dataset using RGCCA was established. Using the package on github simplifies the process and makes easier to redo the model optimization used on this thesis.

The package has functions that can be grouped in three categories:

• Look for models and evaluate them: To search for a model given some conditions, such as that all the blocks are connected, and check the models via bootstrapping or leave one out procedures.

• Reporting: To make better reports by improving handling of names or simplifying the objects or how to calculate scores.

• Building: To easier build correct models on RGCCA, simplifying the process to create a symmetric matrix.

Currently it is only available on github, so the number of downloads and usage is unknown but since its release a user has contacted to keep it up to date with development versions of RGCCA. Currently, it is compatible with the next release of RGCCA being prepared8.

The functions analyze helps to analyze code of a single integration, providing the results on a tidy format for further processing. To create the connections between blocks the function weight_design is available. It creates all the possible matrices given a number of blocks and a number of weights. Optionally it can create just a subset of those based on a numeric vector. However, it does not provide a way to have the design named.
If the user wants to create their own design matrices, they can use symm and modify the design of the model with subSymm. symm, takes an initial matrix to pick up the row and column names. It is recommended that the user checks the design matrix is fully connected, which the package facilitates with the function correct. This is also recommended even if the design matrices are created with weight_design.

To search models search_models starts with a initial connectivity of the blocks and creates all the combinations of connections given.

For the bootstrap procedure there is the function boot_index to create the bootstrapped index of samples to be used by boot_index_sgcca. boot_index randomly selects as much samples as specified by the arguments to create as much indices as the required by the second argument. If the bootstrapped samples used is not important one can use boot_samples_sgcca. If the users want to perform a leave-one-out procedure they can use looIndex.

For more information, you can access the manual online or once it is installed.

The package cannot choose which variables use from the block with information to split into several blocks. However, it provides the model_RGCCA function to make it easy to prepare such variables for RGCCA input. The inteRmodel procedure will only be useful if the user should decide which variables are independent from others and split them into different blocks. To asses this the user can use the methods we used, as described on the above section 3.5. It is important to keep in mind the possible causal relationships on user’s data [197].

## 4.2 Analysis

On the following sections the main results of analyzing each dataset are presented.

### 4.2.1 Puget’s dataset

On this dataset the different parameters and capabilities of RGCCA were tested.

The three different methods, centroid, factorial or horst were tested and compared. The main result of this comparison was that the differences of the selection of the variables mattered more than the number of variables selected with each method. The models were tested with different weights on all three schemes: horst, centroid and factorial. The horst and the centroid scheme were similar while the factorial resulted in the most different AVE values (see S1 Data of [156]). The centroid scheme was selected because it takes into account all the relationship regardless of the sign of the canonical correlation between the blocks. It is similarity to horst scheme.

The effect of the sparsity value was measured by its effect on the inner AVE scores and the combination of the different values for each block as can be seen in figure 4.3.

The first model of the family of models 1 can be seen in table 4.1. The gene expression (GE) is related to the metadata available. Similarly, CGH is also related to the location data available. All other relationships are not included on the model.

Table 4.1: Model 1 for Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. The matrix represent relations between the gene expression and CGH with the metadata.
GE 0 0 1
CGH 0 0 1

When looking for the model that adjust better following this structure we arrived to model 1.2, described in (Table 4.2). In this model the CGH relationship with the location variables is much weaker than GE with them.

Table 4.2: Model 1.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. CGH has a weaker relationship with the metadata than GE.
GE 0 0.0 1.0
CGH 0 0.0 0.1

On model 2 we split the metadata variables into invariable variables and those related to the location (Table 4.3). We relate the GE and CGH to these blocks and between them.

Table 4.3: Model 2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2 GE CGH Invariable Location
GE 0 1 1 1
CGH 1 0 1 1
Invariable 1 1 0 0
Location 1 1 0 0

Following this split, the model that has higher inner AVE for these blocks is the following (Table 4.4).

Table 4.4: Model 2.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2.2 GE CGH Invariable Location
GE 0 1/3 0 1
CGH 1/3 0 1/3 0
Invariable 0 1/3 0 0
Location 1 0 0 0

If we added a superblock with all the data of the different blocks from model 1 we started with the standard relationship between blocks (Table 4.5).

Table 4.5: Model with superblock for the Puget dataset. Relationships between the different blocks in the Puget dataset for model superblock. 0 indicates no relationship and 1 indicates a strong relationship. The superblock is related to all other blocks but they are only related to the superblock.
Model superblock GE CGH Superblock metadata
GE 0 0 1 0
CGH 0 0 1 0
Superblock 1 1 0 1

But when the best model with the superblock that had highest inner AVE is quite different (Table 4.6). It does not have a relationship between the superblock and the GE and the metadata is related to GE and CGH too.

Table 4.6: Model with superblock.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model superblock.2. 0 indicates no relationship and 1 indicates a strong relationship. In this optimized design the GE and the metadata is related to GE and CGH too. The relationship between GE and CGH is weaker than other relations.
Model superblock.2 GE CGH Superblock metadata
GE 1 1/3 0 1
CGH 1/3 0 1 0
Superblock 0 1 0 0

Exploratory analysis with the superblock model was done. The first two components of the superblock did not help to explain the biology or classify the tumors (See figure 4.4).

The same data was used to look for a good model from the data itself including a model with a superblock but looking at the first component of the CGH and transcriptome block. This allowed to visually inspect if each model’s components helped to classify the samples (Figure 4.5).

The first components of the CGH and the transcriptomics blocks of the superblock.2 model show better classification than that of the superblock. However, the other models show a better classification of the samples with much simpler models.

To find these models the three blocks with the best tau and the centroid scheme were analyzed by changing the weights between 0 and 1 by 0.1 intervals. According to the inner AVE, the best model was the one in which the weights (1) between the host transcriptome and location, (2) the host transcriptome and the CGH, and (3) the CGH block were linked to variables related to the location with weights of 1, 0.1 and 0.1, respectively.

When we added a superblock to the data, there was a slight increase of 0.01 on the inner AVE of the model (See Table 4.7). The model with the superblock that explained most of the variance was that in which the weights of the interaction within (1) the host transcriptome, (2) between the superblock and the CGH, (3) between the host transcriptome and the localization, and (4) between acr{CGH} and the host transcriptome were 1, 1, 1 and 1/3, respectively (See table 4.6). To see if the superblock could classify the sample by location, we plotted the first two components of the superblock.

We can clearly see that they do not classify the samples according to the location of the tumor, which is known to affect the tumor phenotype [149].

Adding one block containing the age of the patient and the severity of the tumor to the model, decreased the inner AVE. The best model with these blocks, according to the inner AVE, was that in which the interactions (1) within the host transcriptome, (2) between the host transcriptome and the localization, (3) between the host transcriptome and(4) the CGH and between the CGH and the other variables were 1, 1, 1/3 and 1/3, respectively. The first components of each model can be seen in the figure:

We can observe on the figure 4.5, the strong dependency between gene expression and location since the first model while the weaker relationship with the CGH assay [149]. On the other hand, the major difference is the dispersion on the CGH component on each model.

We explored the effect of the superblock and weights on different models to the inner AVE. There are significant differences between having the superblock and not having it. There is also some significant increase of the inner AVE when adding a superblock.

The different models resulted on the following AVE values as reported in table 4.7.

Table 4.7: AVE values of RGCCA models in the Puget dataset. Values for both inner and outer AVE of the first canonical component of models 1, 1.2, 2 2.2 and superblock and superblock.2 are shown. The model with the highest inner AVE is the model 1.2 and superblock.2 of those with superblock.
Model inner AVE outer AVE
1 0.6333592 0.0692097
1.2 0.8512360 0.0692319
2 0.2791546 0.0738695
2.2 0.6902329 0.0692707
superblock 0.7055847 0.0734578
superblock.2 0.8047477 0.0695821

The model 1.2, is the one with the highest inner AVE, followed by model superblock.2. All of them classify the samples according to the type of tumor by the first component of the transcriptome.

### 4.2.2 HSCT dataset

The PERMANOVA analysis was performed on this dataset to estimate which were the variables that are more relevant. From the many variables the location, sex, patient id and others were found to be related to the variability of the microbiome or the transcriptome on this dataset.

With the PERMANOVA analysis we found that more of the 50% of the variance of normalized RNAseq data and microbiome data respectively is explained by the variables of location, disease, sex, and the interaction between disease and sex. On the transcriptome the most important factor is location which is more than 15% of the variance, while on the microbiome data the most important factor is the patient id followed by location of the sample.

Table 4.8: PERMANOVA analysis of the transcriptome in the HSCT dataset. The variables and their interactions (shown with :) and the $$R^2$$ values and the associated p-value. The higher the $$R^2$$ the more variance is explained by that factor. Location and IBD are the biggest variables affecting the transcriptome, but overall they explain a small part of all the variability.
Factor R p-value
Location 0.18057 0.00020
IBD 0.03120 0.00020
Sex 0.01306 0.00120
IBD:Sex 0.01279 0.11798
Location:IBD 0.02427 0.11458
Location:Sex 0.00816 0.03519
Location:IBD:Sex 0.00486 0.52190
Residuals 0.72508
Table 4.9: PERMANOVA analysis of the microbiome in the HSCT dataset. The variables and their interactions (shown with :) and the $$R^2$$ values and the associated p-value. The higher the $$R^2$$ the more variance is explained by that factor. Location and IBD are the biggest variables affecting the microbiome, but overall they explain a small part of all the variability (smaller than with the gene expression).
Factor R p-value
Location 0.06061 0.0002000
IBD 0.04967 0.0002000
Sex 0.01712 0.0003999
IBD:Sex 0.01091 0.6604679
Location:IBD 0.02089 0.8476305
Location:Sex 0.01139 0.0075918
Location:IBD:Sex 0.00289 0.9994001
Residuals 0.82652

Both tables (4.8 and 4.9) show that location and IBD is a variable that affects the GE and microbiome, but there are some interactions that explain similar percentage of variance.

With globaltest the results were similar: sex, IBD, location, age and time since diagnosis were able to explain the simple endoscopic score for Crohn’s disease (p-value $$5.7 \cdot 10^{-21}$$ ). The resulting p-value was well below the 0,05 threshold defined for RNAseq data on the models including the segment of the sample, sex and treatment.

On the microbiome data the results were similar but the p-value was considerably higher but still below the threshold.

Diversity indices of the samples were explored and compared for several subsets. Splitting by location of the sample and disease provided the highest differences and the diversity index along time did not change much.

The PCA didn’t show any pattern on the microbiome according to the location, as can be seen in the first dimensions of the PCA in figure 4.8.

The PCA on the transcriptome shows a clear pattern splitting by location of the sample on the first dimension (see figure 4.9).

Weighted gene co-expression network analysis did not provide relevant links between bacteria and transcriptome as it failed to find an acceptable scale free degree. As can be seen in the figure 4.10, the scale free topology does not reach the recommended threshold of 0.9 and the mean connectivity is also very low even for the first power.

STATegRa was used between stool 16S data and intestinal 16S data under the assumption that there is a shared common factor without influence of other categorical variables. However, it did not find a good agreement between these two data sources and 16S data source was not longer used on the analysis. In addition, the model is fixed, so it did not allow to find new or other relationships that are not one to one.

With RGCCA we could select different models and use all the data available without much assumptions. The models with the highest inner AVE of the family 1 and the family 2 models were similar to those on the Häsler dataset.

The weights of these models can be observed here in table 4.10.

Table 4.10: Relationships between the different blocks in the HSCT dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship.
Model 0 Transcriptome Microbiome
Transcriptome 0 1
Microbiome 1 0

If we include the information about the samples all together in a block called metadata we can start from this model in table 4.11.

Table 4.11: elationships between the different blocks in the HSCT dataset for model 1.1. 0 indicates no relationship and 1 indicates a strong relationship.
Transcriptome 0 0 1
Microbiome 0 0 1

When looking for the model that adjust better following this blocks we arrived to model 1.2 thanks to inteRmodel, described in table 4.12.

Table 4.12: Relationships between the different blocks in the HSCT dataset for model 1.2. 0 indicates no relationship and 1 indicates a strong relationship.
Transcriptome 0 0.0 1.0
Microbiome 0 0.0 0.1

On model two we split the invariable variables from those related to the location:

Table 4.13: Relationships between the different blocks in the HSCT dataset for model 2. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2 Transcriptome Microbiome Demographic Location Time
Transcriptome 0 1 1 1 0
Microbiome 1 0 1 1 0
Demographic 1 1 0 0 1
Location 1 1 0 0 0
Time 0 0 1 0 0

Following this split, we used inteRmodel (See section above 4.1.3) to find the model that has higher inner AVE for these blocks is the one in table 4.14.

Table 4.14: Relationships between the different blocks in the HSCT dataset for model 2.2. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2.2 Transcriptome Microbiome Demographic Location Time
Transcriptome 0 0.0 0.0 1.0 0.0
Microbiome 0 0.0 0.2 0.1 0.0
Demographic 0 0.2 0.0 0.0 0.6
Location 1 0.1 0.0 0.0 0.0
Time 0 0.0 0.6 0.0 0.0

We also tested specifically a model from the family 2.3, which can be seen in table 4.15.

Table 4.15: Relationships between the different blocks in the HSCT dataset for model 2.3. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2.3 Transcriptome Microbiome Demographic Location Time
Transcriptome 0.0 0.1 0.0 1.0 0
Microbiome 0.1 0.0 0.1 0.1 0
Demographic 0.0 0.1 0.0 0.0 1
Location 1.0 0.1 0.0 0.0 0
Time 0.0 0.0 1.0 0.0 0

The best model of the family 2 confirmed a relationship between the host transcriptome and the location-related variables, while the microbiome was associated with the demographic and location-related variables (see figure 4.11 and S2 data of [156]). Overall, we see that the relationships in the model affected the distribution of samples on the components of both the host transcriptome and the microbiome.

Table 4.16: The models in the HSCT and their AVE values. For each model the inner AVE and the outer AVE is presented.
Model inner AVE outer AVE
0.0 0.3999234 0.1001689
1.0 0.6230190 0.0842333
1.1 0.5678189 0.0848714
1.2 0.7043881 0.0775766
2.0 0.2517363 0.0982050
2.1 0.6940253 0.0940266
2.2 0.8187640 0.0941628
2.3 0.7761846 0.0943938

The different models selected different variables, some of which are shared between models. The most similar models are those that have split the metadata into 3 blocks, followed by those that have the metadata in a single block.

In order to analyze the accuracy of the models, one thousand bootstraps were used to integrate the data from the HSCT CD dataset. Each bootstrap had its own dispersion on the variables according to the samples selected, the distribution of the bootstraps used are represented here:

Evaluating the same model on each bootstrap lead to a dispersion on the inner AVE of the model. The lower the dispersion, the more robust the model was to different conditions than in the initial testing.

With the bootstrapped models we used BaseSet to estimate the probability that each variable to be relevant for the association with a disease. However, due to big amount of small probabilities when using the BaseSet package to calculate which variables are more relevant it could not provide a good estimation on time.

MCIA was applied as a baseline of the integration, the first two components were represented similarly to those of the blocks when using RGCCA, see figure 4.14 of its outcome.

The AUC of classifying the transcriptome in colon or ileum segments was compared between the models (see figure 4.15 and with MCIA).

These models have the following AUC to classify the location of the sample according to the first component of the transcriptome block.

Table 4.17: AUC values of RGCCA models of the HSCT dataset classyfing the location of the sample according to the first component of the transcriptome block. From model 0 to model 2.3, the best classification is achieved with model 2.1. Note: this is removing two samples for which the location is unkown.
Model AUC
0 0.4537037
1 0.4309414
1.1 0.4639275
1.2 0.5958719
2 0.9450231
2.1 0.9988426
2.2 0.9980710
2.3 0.9969136

On MCIA the AUC for the classification of ileum or colon samples is of 0.9851 once those two samples with unknown location are excluded. This is on par with the models of family 2 as can be seen in the table 4.17.

The different models selected different variables as can be seen in figure 4.16.

Differences and similarities between the selected features of each model can be observed in figure 4.16. Genes are very similar between model 0 to 1.2 and between 2 to 2.3, while OTUs are very unique on model 0 and others shared between most models.

### 4.2.3 Häsler’s dataset

If we look at the dataset 16S, the PCA does not show a pattern as can be seen in figure 4.17.

The PCA on the transcriptome shows a distinction between colon and ileum according to the first dimension of the PCA that can be seen in figure 4.18.

In this dataset, the parameter tau behaved slightly differently than with the previous dataset but the value from the Schäfer’s method for tau was close to the best value.

Models for Häsler dataset are the following:

Table 4.18: Model 0 of the Häsler dataset. 0 indicates no relationship and 1 indicates a strong relationship.
Model 0 RNAseq micro
RNAseq 0 1
micro 1 0

The first model for family 1 is in table 4.19.

Table 4.19: Model 1.1 of the Häsler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNASeq and microbiota (micro) are related to the metavariables (meta).
Model 1.1 RNAseq micro meta
RNAseq 0 0 1
micro 0 0 1
meta 1 1 0

The after optimization of the model of family 1, the best one is in table 4.20.

Table 4.20: Model 1.2 of the Häsler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNASeq and microbiota (micro) are related to the metavariables (meta), the former much more than the microbiota.
Model 1.2 RNAseq micro meta
RNAseq 0 0.0 1.0
micro 0 0.0 0.1
meta 1 0.1 0.0

The first model for family 2 is in table 4.21.

Table 4.21: Model 2.1 of the Häsler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNAseq is related to the location, microbiota to location and demographic variables and demographic variables are also related to time.
Model 2.1 RNAseq micro Location Demographic Time
RNAseq 0 0.0 1.0 0 0
micro 0 0.0 0.5 1 0
Location 1 0.5 0.0 0 0
Demographic 0 1.0 0.0 0 1
Time 0 0.0 0.0 1 0

After optimization of models of family 1 the best model according to the inner AVE score is in table 4.22.

Table 4.22: Model 2.2 of the Häsler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNAseq is related to the microbiota and location (much stronger), the microbiota to the demographic and time (stronger) and demographic and time are linked too.
Model 2.2 RNAseq micro Location Demographic Time
RNAseq 0.0 0.1 1 0.0 0.0
micro 0.1 0.0 0 0.1 1.0
Location 1.0 0.0 0 0.0 0.0
Demographic 0.0 0.1 0 0.0 0.1
Time 0.0 1.0 0 0.1 0.0

In table 4.23, we can see here the AVE scores of each of the previous models.

Table 4.23: AVE values of RGCCA models in the Häsler dataset. The inner and the outer AVE scores of multiple models tested on the Häsler dataset are shown. The model with the hightest inner AVE is model 1.2 followed by model 2.2.
Model inner AVE outer AVE
0 0.8217371 0.0961236
1.1 0.7461423 0.1024148
1.2 0.8349410 0.1025486
2.1 0.4980681 0.1008395
2.2 0.7513065 0.1009915

In contrast to the HSCT’s dataset (table 4.16), the model with the highest inner AVE was model 1.2 but model 2.2 was close to it (see table 4.23). Model 2.2 has a relationship of 0.1 between microbiome and the host transcriptome and of 1 between the location and the host transcriptome. The microbiome block is also related by a factor of 0.1 with the demographic block and of 1 with the time block. Lastly, the time and the demographic block are related by a factor of a 0.1. In either case the family 1 and family 2 models can correctly separate by sample location (colon or ileum) but not by disease type or inflammation status as can be seen in figure 4.20.

There is no observable cluster of IBD samples and the other samples, showing that on this dataset the differences of the microbiome between the different type of samples are less stark. The classification of samples was very accurate in all the models, specially on model 2.2, see figure 4.21.

This accuracy resulted on high AUC values for all the models, as can be seen in table 4.24.

(ref:hasler-auc-s) AUC of the RGCCA models in Häsler’s dataset. (ref:hasler-auc) The AUC was calculated with the first dimension of the gene expression block ability to predict location of the sample. Models of family 2 reach 100% classification.

Table 4.24: (ref:hasler-auc-s) (ref:hasler-auc)
Model AUC
0 0.8011494
1.1 0.9781609
1.2 0.9977011
2.1 1.0000000
2.2 1.0000000

MCIA was applied as a baseline of the integration and compared to the different models to know which one separates best colon and ileum samples. The result on the first two dimension is shown in figure 4.22.

MCIA’s AUC results was as high as the model 2.2 to classify samples according to their location. It was even better to classify the samples according to the type of sample they are: 0.6248 vs 1 the best AUC from RGCCA that corresponds to model 1.2.

### 4.2.4 Morgan’s dataset

To explore this dataset that is different from the others with IBD.

The PCA didn’t show any pattern on the microbiome according to the location, see figure 4.23.

The PCA on the transcriptome did not show a clear distinction between the PPI and the pouch in figure 4.24 but there is a pattern.

We tested if results of inteRmodel were consistent on this dataset with the other datasets. The first model we tried is model 0 as in table 4.25.

Table 4.25: Relationships between the different blocks in the Morgan dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and microbiome are linked.
Model 0 Transcriptome Microbiome
Transcriptome 0 1
Microbiome 1 0

We then added the data about the samples as provided 3.1.4, on a simple model as in table 4.26.

Table 4.26: Relationships between the different blocks in the Morgan dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and microbiome are related to the metadata.
Transcriptome 0 0 1
Microbiome 0 0 1

When looking for the model that adjust better following this structure we arrived to model 1.2, described in table 4.27.

Table 4.27: Model 1.2 of the Morgan dataset. Relationships between the different blocks in the Morgan dataset for model 1.2. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome is related to the microbiome but weakly and the microbiome is also related to the metadata.
Transcriptome 0.0 0.1 0
Microbiome 0.1 0.0 1

On model two we split the invariable variables from those related to the location (see table 4.28).

Table 4.28: Relationships between the different blocks in the Morgan dataset for model 2. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and the microbiome are linked and related to demographic and location variables.
Model 2 Transcriptome Microbiome Demographic Location
Transcriptome 0 1 1 1
Microbiome 1 0 1 1
Demographic 1 1 0 0
Location 1 1 0 0

The model that has higher inner AVE for these blocks is the following:

Table 4.29: Relationships between the different blocks in the Morgan dataset for model 2.2. 0 indicates no relationship and 1 indicates a strong relationship. The transcriptome is weakly related to the microbiome and location but much stronger related to the demographic variables.
Model 2.2 Transcriptome Microbiome Demographic Location
Transcriptome 0.0 0.1 1 0.1
Microbiome 0.1 0.0 0 0.0
Demographic 1.0 0.0 0 0.0
Location 0.1 0.0 0 0.0

Each model is different from previous models. After model 2.2 we looked on the model similar to model 2.3 in the HSCT dataset showed but it is the same as in model 2.2. However, it is kept on the further analysis.

The different models were not able to separate the samples neither by location or sex.

Nevertheless, we compared the classification with the MCIA algorithm and still resulted that model 2.2 provide a better classification than MCIA.

Table 4.30: AVE values of RGCCA for the Morgan dataset. The inner and the outer AVE scores of multiple models tested on the Morgan dataset are shown. The model with the highest inner AVE is model 1.2 followed by 2.2.
Model inner AVE outer AVE
0.0 0.4735601 0.1098639
1.0 0.6333592 0.1152280
1.1 0.2448234 0.1104746
1.2 0.7868443 0.0422660
2.0 0.4404123 0.1088730
2.1 0.6052598 0.1074900
2.2 0.6895661 0.1081315
2.3 0.6895661 0.1081315

When exploring the bootstraps of the data we found that model 1.2 is highly variable:

In addition the model 2.2 (table 4.29) usually has a lower inner AVE compared to model 1.2 (table 4.28).

The area under the curve for these models are:

Table 4.31: AUC for the Morgan dataset classificating the localization of the sample according to the first component of the gene expression of the models generated with RGCCA.
Model AUC
0 0.4969734
1 0.7934971
1.1 0.8161536
1.2 0.5606192
2 0.8546351
2.1 0.8473712
2.2 0.8352646
2.3 0.8352646

With MCIA we can observe that the location does not separate the samples (see figure 4.28).

If we quantify this separation by the first dimension of MCIA, the AUC is 0.818, which is slightly worst than the models of family 2.

### 4.2.5 Howell’s dataset

The 16S of this dataset doesn’t show any clear pattern regarding the location of the samples according to the firsts dimensions of the PCA on figure 4.29.

The PCA on the transcriptome shows a distinction between colon and ileum, see figure 4.30, there are almost two distinct groups according to location.

This dataset was processed to confirm the results on the previous datasets. As always first we started with model 0, see table 4.32, connecting both the RNAseq and the 16S blocks.

Table 4.32: Relationships between the different blocks in the Howell dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship.
Model 0 RNAseq 16S
RNAseq 0 1
16S 1 0

Later we look for the best model of family 1 (without looking at any previous model of family 1). This resulted on the following model in table 4.33.

Table 4.33: Relationships between the different blocks in the Howell dataset for model 1.2. 0 indicates no relationship and 1 indicates a strong relationship.
RNAseq 0.0 0.1 1
16S 0.1 0.0 0

Model 1.2, in table 4.33, was the best according to the AVE score but perform worse when attempting to recreate known biological differences via classifying samples as we can see in table 4.31.

Model 2.2 was selected for further analysis as it describes more accurately the biology of the dataset it. Model 2.2 can be seen in table 4.34.

Table 4.34: Relationships between the different blocks in the Howell dataset for model 2.2. 0 indicates no relationship and 1 indicates a strong relationship.
Model 2.2 RNAseq 16S demographics location
RNAseq 0 0 0.0 1.0
16S 0 0 1.0 0.0
demographics 0 1 0.0 0.1
location 1 0 0.1 0.0

Model 1.2 has a 0.1 relationship between the ASV and the transcriptome and 1 between transcriptome and metadata. While model 2.2 has a relationship of 1 between location and transcriptome and demographics and ASV but only of 0.1 between demographics and location.

Table 4.35: AVE values of RGCCA for the Howell dataset. The inner and the outer AVE scores of multiple models tested in the Howell dataset are shown. The model with the hightest inner AVE is model 1.2.
Model inner AVE outer AVE
0.0 0.7180980 0.1112390
1.2 0.8972258 0.1660267
2.2 0.8433274 0.1659844

The bootstrapping showed that model 1.2 has indeed higher inner AVE values than model 2.2 and is more stable than model 1.2. While model 0 shows a high variation according to which samples are selected.

If we look at the classification of the models in figure 4.34, we can see that models 1.2, 2 and 2.2 classify perfectly the samples by the transcriptome into the location of the sample.

The AUC of each model can be seen in table 4.36.

Table 4.36: AUC of the RGCCA models in the Howell dataset. The classification of the location of the sample according to the first componenent of the models shown. Model 0, 2 and 2.2 have a perfect classification of the samples to their respective location.
model AUC
0 0.6255259
1 0.5974755
1.2 1.0000000
2 1.0000000
2.2 1.0000000

In this dataset we also focused on the most important ASV according to the model 2.2 that were present in more than 2 samples that in total were present in the whole dataset. These ASV were summarized to a single value and then used to calculate the AUC, which was 0.85. The dot product of the ASV and genes were also calculated and used to find out which ASV are related to which genes.

### 4.2.6 Between datasets

The HSCT genes were compared to the Howell’s genes from model 2.2. There are 3580 selected on model 2.2 in the HSCT dataset and 2189 genes on the Howell dataset. From them the 1228 genes in common were analyzed for which GO terms and pathways they are enriched. The results is represented on figure 4.35.

### References

149. Puget S, Philippe C, Bax DA, Job B, Varlet P, Junier M-P, et al. Mesenchymal Transition and PDGFRA Amplification/Mutation Are Key Distinct Oncogenic Events in Pediatric Diffuse Intrinsic Pontine Gliomas. PLOS ONE. 2012;7:e30313.
156. Revilla L, Mayorgas A, Corraliza AM, Masamunt MC, Metwaly A, Haller D, et al. Multi-omic modelling of inflammatory bowel disease with regularized canonical correlation analysis. PLOS ONE. 2021;16:e0246367.
194. Revilla Sancho L, Lozano J-J, Salas A. experDesign: stratifying samples into batches with minimal bias. Journal of Open Source Software. 2021;6:3358.
196. Wickham H. Advanced r. Second Edition. Boca Raton: Chapman & Hall; 2019.
197. Greenland S, Brumback B. An overview of relations among causal modelling methods. International Journal of Epidemiology. 2002;31:1030–7.

1. S4 is one of the object programming paradigms on R. For a more complete overview and differences see Advanced R, freely available online: https://adv-r.hadley.nz/oo.html [196].↩︎

2. We also contributed with some comments and feedback to the package to make it easier to read the source and check the inputs and improve the documentation so that it is coherent with the code and previous results of the functions.↩︎