The code at this page can be downloaded here

Before starting, please ensure you have installed the TriplotGUI package following the Setup instructions.

Data description

The data used in this tutorial section can be downloaded here in rda format.

This example utilizes HealthyNordicDiet_2, which is a modified version of the HealthyNordicDiet data provided by the Triplot package (Schillemans et al. 2019). The HealthyNordicDiet is a synthetic dataset that simulates data from a case-control study within the Swedish Västerbotten Intervention Programme cohort, exploring plasma metabolites reflecting healthy Nordic dietary patterns and their association with type 2 diabetes development (Shi et al. 2018). The simulated HealthyNordicDiet_2 data consists of three data frames:

Clinical measurements (ClinData): Contains 11 variables: "ID","T2D"(Cases and controls for type 2 diabetes), "PairedInfo"(Pairing/matching information for case and controls, a unique number is assigned to each case-control pairs),"Energy","Gender","Age","BMI","Smoking","PhysicalActivityIndex","Education","FastingGlucose".
Plasma metabolites predictive of BMI (MetaboliteData): Comprises 20 variables
Dietary intake as measured by food frequency questionnaires (FoodData): Includes 11 variables: "HFI"(Healthy Food Index),"BSDS"(the Baltic Sea Diet Score),"Wholegrains","Sausage","Pizza","Refined.bread", "Fruits","Liquor","Wine","Hamburger","Poultry","Fish","Margarine","Cabbage","Carrot","low_fat_Dairy","Vegetables"

The data frames are row-wise matched by observations and consist of 1000 synthetic observations.

Reference

Shi, L., Brunius, C., Johansson, I., Bergdahl, I. A., Lindahl, B., Hanhineva, K., & Landberg, R. (2018). Plasma metabolites associated with healthy Nordic dietary indexes and risk of type 2 diabetes—a nested case-control study in a Swedish population. The American Journal of Clinical Nutrition, 108(3), 564-575. 10.1093/ajcn/nqy145

Schillemans T, Shi L, Liu X, Åkesson A, Landberg R, Brunius C. Visualization and Interpretation of Multivariate Associations with Disease Risk Markers and Disease Risk-The Triplot. Metabolites. 2019 Jul 6;9(7):133. doi: 10.3390/metabo9070133

Research question

We want to assess the relationship between diet, metabolic profiles and Type 2 diabetes and BMI. We assume that metabolites serves as mediators, mediating the effect of dietary exposures on Type 2 diabetes and BMI.

Data exploration

HealthyNordicDiet_2 is loaded in the R environment upon running library(TriplotGUI). Let’s begin with some data exploration to understand the structure and contents of the dataset:

Check HealthyNordicDiet_2

Check the HealthyNordicDiet_2 list:

Show the code

class(HealthyNordicDiet_2)

[1] "list"

Show the code

names(HealthyNordicDiet_2)

[1] "ClinData"       "MetaboliteData" "FoodData"

Check datasets

Check the names of variables in each data:

Show the code

colnames(HealthyNordicDiet_2$ClinData)

 [1] "ID"                    "T2D"                   "PairedInfo"           
 [4] "Energy"                "Gender"                "Age"                  
 [7] "BMI"                   "Smoking"               "PhysicalActivityIndex"
[10] "Education"             "FastingGlucose"

Show the code

colnames(HealthyNordicDiet_2$MetaboliteData)

 [1] "Pipecolic.acid.betaine"         "lysoPC.18.0."                  
 [3] "HP143.1179.1.55"                "lysoPE.22.6."                  
 [5] "HN151.0065.4.97"                "RP225.1482.9.15"               
 [7] "X3.4.5.Trimethoxycinnamic.acid" "DHA"                           
 [9] "RP383.1671.10.53"               "gama.tocopherol"               
[11] "RP431.3516.11.96"               "RP490.3516.8.99"               
[13] "lysoPC.22.6."                   "Steroid.glucuronide"           
[15] "RP684.5545.13.41"               "RP197.0926.3.17"               
[17] "RP303.6468.9.98"                "RP816.5684.12.52"              
[19] "RP827.5356.11.88"               "RP826.5442.12.17"              
[21] "RN201.1498.8.23"                "RN203.0022.3.00"               
[23] "EPA"                            "PC.18.2.15.0."                 
[25] "RN814.5610.12.14"               "RN153.0190.3.5"                
[27] "RN427.1641.10.59"               "RN457.1742.10.50"              
[29] "RN830.5845.12.46"               "RN446.1541.10.59"              
[31] "RN472.1596.10.53"

Show the code

colnames(HealthyNordicDiet_2$FoodData)

 [1] "HFI"           "BSDS"          "Wholegrains"   "Sausage"      
 [5] "Pizza"         "Refined.bread" "Fruits"        "Liquor"       
 [9] "Wine"          "Hamburger"     "Poultry"       "Fish"         
[13] "Margarine"     "Cabbage"       "Carrot"        "low_fat_Dairy"
[17] "Vegetables"

Check variables’ class

We then transform the data to dataframe format and use TriplotGUI’s checkdata() function to examine the classes of variables.

Show the code

ClinData <- as.data.frame(HealthyNordicDiet_2$ClinData)
MetaboliteData <- as.data.frame(HealthyNordicDiet_2$MetaboliteData)
FoodData <- as.data.frame(HealthyNordicDiet_2$FoodData)

ClinData_check <- checkdata(ClinData)
MetaboliteData_check <- checkdata(MetaboliteData)
FoodData_check <- checkdata(FoodData)

Show the code

ClinData_check$class_summary_statistics

$check_class_vector
                   ID                   T2D            PairedInfo 
            "numeric"              "factor"              "factor" 
               Energy                Gender                   Age 
            "numeric"              "factor"             "numeric" 
                  BMI               Smoking PhysicalActivityIndex 
            "numeric"              "factor"              "factor" 
            Education        FastingGlucose 
             "factor"             "numeric" 

$check_class_table
check_class_vector
 factor numeric 
      6       5

Show the code

MetaboliteData_check$class_summary_statistics

$check_class_vector
        Pipecolic.acid.betaine                   lysoPC.18.0. 
                     "numeric"                      "numeric" 
               HP143.1179.1.55                   lysoPE.22.6. 
                     "numeric"                      "numeric" 
               HN151.0065.4.97                RP225.1482.9.15 
                     "numeric"                      "numeric" 
X3.4.5.Trimethoxycinnamic.acid                            DHA 
                     "numeric"                      "numeric" 
              RP383.1671.10.53                gama.tocopherol 
                     "numeric"                      "numeric" 
              RP431.3516.11.96                RP490.3516.8.99 
                     "numeric"                      "numeric" 
                  lysoPC.22.6.            Steroid.glucuronide 
                     "numeric"                      "numeric" 
              RP684.5545.13.41                RP197.0926.3.17 
                     "numeric"                      "numeric" 
               RP303.6468.9.98               RP816.5684.12.52 
                     "numeric"                      "numeric" 
              RP827.5356.11.88               RP826.5442.12.17 
                     "numeric"                      "numeric" 
               RN201.1498.8.23                RN203.0022.3.00 
                     "numeric"                      "numeric" 
                           EPA                  PC.18.2.15.0. 
                     "numeric"                      "numeric" 
              RN814.5610.12.14                 RN153.0190.3.5 
                     "numeric"                      "numeric" 
              RN427.1641.10.59               RN457.1742.10.50 
                     "numeric"                      "numeric" 
              RN830.5845.12.46               RN446.1541.10.59 
                     "numeric"                      "numeric" 
              RN472.1596.10.53 
                     "numeric" 

$check_class_table
check_class_vector
numeric 
     31

Show the code

FoodData_check$class_summary_statistics

$check_class_vector
          HFI          BSDS   Wholegrains       Sausage         Pizza 
     "factor"     "numeric"     "numeric"     "numeric"     "numeric" 
Refined.bread        Fruits        Liquor          Wine     Hamburger 
    "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
      Poultry          Fish     Margarine       Cabbage        Carrot 
    "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
low_fat_Dairy    Vegetables 
    "numeric"     "numeric" 

$check_class_table
check_class_vector
 factor numeric 
      1      16

Note

Note that in the given data, "HFI"(Healthy Food Index) is a factor variable and "BSDS"(the Baltic Sea Diet Score) is a numeric variable. You could use table(FoodData$HFI) and table(FoodData$BSDS) to check the difference.

Check sanities for variables

To check for missing (NA) or abnormal values (e.g., NaN, negative values, infinite values, blank values) in each variable, you can use the checkdata() function. This function generates a summary table for each data frame, showing the number of observations containing NA, NaN, zero, negative, infinite (Inf), and blank values, along with their percentages.

Show the code

if (!require(kableExtra)) {
  install.packages("kableExtra")  # Install the package if it's not available.

}
library(kableExtra)             # Load it after installation.

Show the code

kable(ClinData_check$everycolumn)

	ID	T2D	PairedInfo	Energy	Gender	Age	BMI	Smoking	PhysicalActivityIndex	Education	FastingGlucose
NA_number	0	0	0	0	0	0	0	0	0	0	0
NA_percentage	0	0	0	0	0	0	0	0	0	0	0
NaN_number	0	0	0	0	0	0	0	0	0	0	0
NaN_percentage	0	0	0	0	0	0	0	0	0	0	0
negative_number	0	0	0	0	0	0	0	0	0	0	0
negative_percentage	0	0	0	0	0	0	0	0	0	0	0
zero_number	0	0	0	0	0	0	0	0	0	0	0
zero_percentage	0	0	0	0	0	0	0	0	0	0	0
inf_number	0	0	0	0	0	0	0	0	0	0	0
inf_percentage	0	0	0	0	0	0	0	0	0	0	0
blank_number	0	0	0	0	0	0	0	0	0	0	0
blank_percentage	0	0	0	0	0	0	0	0	0	0	0

Show the code

kable(MetaboliteData_check$everycolumn)

	Pipecolic.acid.betaine	lysoPC.18.0.	HP143.1179.1.55	lysoPE.22.6.	HN151.0065.4.97	RP225.1482.9.15	X3.4.5.Trimethoxycinnamic.acid	DHA	RP383.1671.10.53	gama.tocopherol	RP431.3516.11.96	RP490.3516.8.99	lysoPC.22.6.	Steroid.glucuronide	RP684.5545.13.41	RP197.0926.3.17	RP303.6468.9.98	RP816.5684.12.52	RP827.5356.11.88	RP826.5442.12.17	RN201.1498.8.23	RN203.0022.3.00	EPA	PC.18.2.15.0.	RN814.5610.12.14	RN153.0190.3.5	RN427.1641.10.59	RN457.1742.10.50	RN830.5845.12.46	RN446.1541.10.59	RN472.1596.10.53
NA_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
NA_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
NaN_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
NaN_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
negative_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
negative_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
zero_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
zero_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
inf_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
inf_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
blank_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
blank_percentage	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Show the code

kable(FoodData_check$everycolumn)

	Wholegrains	Sausage	Pizza	Refined.bread	Fruits	Liquor	Wine	Hamburger	Poultry	Fish	Margarine	Cabbage	Carrot	low_fat_Dairy	Vegetables
NA_number	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
NA_percentage	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
NaN_number	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
NaN_percentage	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
negative_number	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
negative_percentage	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
zero_number	63.000	139.000	195.000	155.000	180.00	367.000	288.000	231.000	117.000	79.000	41.000	253.000	225.000	237.000	192.000
zero_percentage	0.063	0.139	0.195	0.155	0.18	0.367	0.288	0.231	0.117	0.079	0.041	0.253	0.225	0.237	0.192
inf_number	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
inf_percentage	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
blank_number	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
blank_percentage	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Note

You shall only continue to use the data when the class of variables are correct and the missing or abnormal values in the variable are properly handled.

Build data for analysis

We treat food variables as our exposures and BMI and type 2 diabetes as outcomes. Note that BMI is a continuous variable. We can additionally construct BMI as a categorical outcome using the following criteria:

BMI <= 18.5: Underweight
18.5 <BMI <= 25: Normal
25 < BMI <= 30: Overweight
30 < BMI: Obese

Show the code

BMI <- ClinData[,"BMI"] 

BMI_cat <- cut(BMI,  
               breaks = c(-Inf, 18.5, 25, 30, Inf),  
               labels = c('underweight', 'normal', 'overweight', 'obese'),  
               right = TRUE) 

BMI_cat<-factor(BMI_cat,levels=c("normal","underweight", "overweight", "obese"))

We want to explore the exposure-outcome relationships through the metabolomics data, using metabolites as assumed mediators. Food data serves as exposures, while BMI (both continuous and categorical) and Type 2 Diabetes are treated as outcomes. Gender, age, education, physical activity, and smoking are used as potential confounders for the exposure-mediator and mediator-outcome associations. Additionally, auxiliary information regarding case-control matches, fasting glucose levels, education, and physical activity is provided to enhance the analysis.

Show the code

exposure2 <- FoodData
Omics2 <- MetaboliteData
outcome2 <- cbind.data.frame(ClinData[,c("T2D","BMI")], BMI_cat)
covariates2 <- ClinData[, c("Gender" ,"Age","Education","PhysicalActivityIndex","Smoking" )]
auxiliary2 <- ClinData[, c("PairedInfo","FastingGlucose","Education","PhysicalActivityIndex", "Energy")]

Code example of using TriplotGUI

We will use more complex settings in this example. Please go to Tutorial(simple).if you want to try simpler settings.

Step 1: Data reduction of Omics variables

Using TriplotGUI package, first we perform dimension reduction, using weighted correlation network analysis (WGCNA) on metabolomics data.

Show the code

# Using the complex settings
reduced_Omics2 <- OmicsReduction(dataframe = Omics2,
                                   plotType = c("scree","score","loading","scoreloading"),
                                   pcType = "principal", 
# Users can choose freely to use principal() instead of prcomp() to perform PCA.The result should be very similar, but principal() allows more functionalities
                                   firstPC = 1,
                                   secondPC= 2,
                                   option = "WGCNA",
# In this example we use WGCNA to perform data reduction, therefore specifying `option="WGCNA"`.
                                   scale = TRUE,
                                   center = TRUE,
                                   eigenLoading = "loading",
# If you want to show loading or scale it to eigevalues.
                                   rotate = "none",
# This argument is inherited from the principal() function
                                   sizeVariable = auxiliary2$FastingGlucose,
                                   sizeVariableName = "FastigGlucose",
                                   colorVariable = auxiliary2$Education,
                                   colorVariableName = "Education",
                                   shapeVariable = auxiliary2$PhysicalActivityIndex,
                                   shapeVariableName = "PhysicalActivityIndex",
# The size,shape,color variables are used to differentiate points on the score plot and biplot 
                                   loadingsName = TRUE,
# Show the names of loadings
                                   loadingsCutpercent = 0.2,
# The loadings bigger than 20% of all the loadings will show up
                                   minModuleSize = 3
# The mimimum number of variable in each cluster of WGCNA
)

Note

In WGCNA, Omics variables are grouped into clusters. PCA is performed on each cluster, and the first principal component of each cluster is used as the module for WGCNA.
You could try out the arguments of the OmicsReduction() function to visualize the results and view the whole function here
The minModuleSize argument in OmicsReduction() is crucial as it defines the minimum number of variables in each cluster. Setting this too high may result in all variables being grouped into a single cluster. For instance, with 20 variables and minModuleSize=15, you might end up with one cluster containing all 20 variables. In this case, only the first principal component of this single cluster would be used and displayed on both axes of the resulting triplots.

To avoid this, adjust minModuleSize based on your total variable count and desired cluster granularity. Experiment with different values to achieve a balance between meaningful clusters and interpretable results.

You can see scree plot, score plot, loadings plot and biplot at this stage.

Show the code

reduced_Omics2$scree_plot

plot

The scree plot in WGCNA visualizes the variance explained by each cluster. Clusters are color-coded, with column heights representing the proportion of total variance explained by each cluster. The darker color in each column shows the variance explained by the first principal component (module) of that cluster.

The grey cluster, showing only a darker color, indicates it contains just one variable, which serves as its own first principal component/module. This can occur even with minModuleSize=2, as some variables may be left unclustered.

Clusters are arranged from left to right based on the number of variables they contain, with the leftmost cluster having the most variables and the rightmost having the least. This arrangement helps in quickly identifying the relative sizes and importance of different clusters in the dataset.

Show the code

reduced_Omics2$score_plot

plot

In the score plot, observations are colored by education level, with different shapes representing varying physical activity indices and different sizes indicating fasting glucose levels. This visualization allows for an integrated view of how these characteristics relate to the modules derived from the WGCNA analysis, highlighting patterns among individuals based on their traits.

Show the code

reduced_Omics2$loading_plot

plot

The loading plot displays loadings aligned along the axes, reflecting the use of first principal components/modules from each cluster in WGCNA. This alignment occurs because each axis represents a distinct principal component analysis, making it impossible for loadings to point in the direction of another axis.

Show the code

reduced_Omics2$scoreloading_plot

plot

Note

Sometimes the following text will show up in your console and the figure cannot be directly run:

libpng warning: Image width is zero in IHDR

libpng error: Invalid IHDR data

Warning message:

In dev.off(which) : agg could not write to the given file\

Run gc() in your console to clear the memory. Then rerun the code again to view the figure.

Although informative, the biplot combining both score and loading can appear cluttered due to the high density of information it contains.

A TPObject is constructed based on the data reduction results from the WGCNA analysis. This object serves as a central repository for storing and trandfering information generated through the various steps during the analysis.

Show the code

scores <- reduced_Omics2$object$scores
loadings <- reduced_Omics2$object$loadings
variance <- reduced_Omics2$object$variance
TPObject1 <- makeTPO(scores = scores,
                     loadings = loadings,
                     variance = variance)


Making TriPlotObject (TPO)
--------------------------

Score matrix has 1000 observations and 5 components.

TPO has 0 attached correlations.
TPO has 0 attached risks.

Step 2-3: Exposure-Omics and Omics-Outcome associations

The associations between modules and food items and the associations between modules and risk markers can be then calculated, with confounders adjusted. Note that it is possible for users to skip code regarding exposure association or risk association, if their interests focus on only outcome associations or exposure associations with Omics variables.

The modules information saved in the TPObject can be directly utilized in the exposureAssociation() function. This function calculates exposure associations coefficients and p-values between the modules and exposures (food variables and diet scores), while adjusting for specified confounders.

Show the code

Correlations_object <- exposureAssociation(TPObject = TPObject1,
                                exposure = exposure2,
                                use = 'pairwise', 
                                # use inherit from cor() function 
                                method = 'pearson',
                                # method inherit from cor() function 
                                allowCategorical = FALSE,
                                # one-hot-encoding on categorical exposures
                                partial = TRUE,
                                # perform confonder adjustment
                                confounder = covariates2)

Note

The exposureAssociation() function offers options for handling categorical variables and adjusting for confounders. When setting allowcategorical=FALSE, categorical exposures with more than two classes undergo one-hot-encoding, transforming them into binary variables. This can be verified by examining Correlations_object$cor_estimate, which shows separate correlation coefficients for each binary representation of HFI levels. However, allowcategorical=FALSE is limited for meaningful use. The binary variables may be only meaningful when we want to compare one level to all the other levels together. We use this option for categorical variable "HFI" for demonstration purpose. Ideally, one should use allowcategorical=TRUE for HFI.

Conversely, allowcategorical=TRUE leaves categorical variables like "HFI" untransformed. In this case, linear models regressing modules on "HFI" are used for correlation analysis. The correlation coefficients are derived from the square root of the proportion of HFI’s sum of squares to the total sum of squares.
Setting partial=TRUE ensures that specified confounders are adjusted for in the correlation analysis.

The correlation results obtained from the exposureAssociation() function are then incorporated into the TPObject.

Show the code

TPObject2 <- addExposure(TPObject = TPObject1,
                     corrEstimate = Correlations_object$cor_estimate,
                     corrPvalue = Correlations_object$cor_pvalue)


Adding correlation to TPO
-------------------------

TPO has 23 attached correlations.
TPO has 0 attached risks.

The modules information saved in the TPObject can be directly utilized in the outcomeAssociation() function. This function calculates risk estimates and p-values between the modules and outcomes (BMI, type 2 diabetes and categorical BMI), while adjusting for specified confounders.

Show the code

Risks_object <- outcomeAssociation(TPObject = TPObject2,
                                outcome = outcome2,
                                confounder = covariates2,
                                partial = TRUE,
                                # performs confounder adjustment
                                multinomial = TRUE,
                                # multinomial regression is used
                                pair = auxiliary2$PairedInfo,
                                # use the pairing information of case and control
                                CI = 0.95
                                # confidence interval
                                )

Note

The outcomeAssociation() function offers various options for handling outcomes and adjusting for confounders, with behavior differing based on whether pair information is provided.

Without pair information:
- multinomial=FALSE transforms categorical outcomes with >2 classes into binary variables via one-hot-encoding. Logistic regression is used for binary outcomes, and linear regression for continuous outcomes.
- multinomial=TRUE applies multinomial regression to categorical outcomes with >2 classes, yielding n-1 estimates. Binary and continuous outcomes are handled as before.
With pair information:
- multinomial=FALSE transforms categorical outcomes as above, then uses conditional logistic regression for binary outcomes and linear mixed models for continuous outcomes, incorporating pairing information.
- multinomial=TRUE ignores pairing information, applying multinomial regression to categorical outcomes and standard logistic/linear regression to binary/continuous outcomes.
Setting partial=TRUE ensures adjustment for specified confounders in all analyses.

The results for risk estimates obtained from the outcomeAssociation() function are then incorporated into the TPObject.

Show the code

TPObject3 <- addOutcome(TPObject = TPObject2, 
                     Risk = Risks_object)


Adding risk to TPO
------------------

TPO has 23 attached correlations.
TPO has 5 attached risks.

Step 4: Meet-in-the-middle Triplot Visualization

The TPObject created from all steps can be used to generate a Triplot using the TriplotGUI() function. The function serves as a wrapper for PCA_TriplotGUI() and WGCNA_TriplotGUI(). You can check the functions respectively here: TriplotGUI(), PCA_TriplotGUI(), WGCNA_TriplotGUI().

Show the code

TPO_plots3 <- TriplotGUI(TPObject3,
                              firstPC = 1, ## The first PC to map
                              secondPC = 2, ## The first PC to map
                              plotLoads = TRUE, ##Whether to plot loadings (TRUE; default) or suppress them (FALSE)
                              plotScores = FALSE, ##Whether to plot scores (TRUE) or suppress them (FALSE; default)
                              plotCorr = TRUE, ##Whether to plot correlations (TRUE; default) or suppress them (FALSE)
                              plotRisk = TRUE, ##Whether to plot risk estimates (TRUE; default) or suppress them (FALSE)
                              ###############################################
                              ##For loadings
                              loadLabels = TRUE, ###Whether to plot variable loading labels (TRUE; default) or not (FALSE)
                              loadArrowLength = 0.02, ###Length of arrow tip , set it as 0 if you want to remove it
                              loadCut = 0, ###lower limit Loadings below the cut are plotted in light grey and without label
                              loadLim = NULL, ##higher limit,Plot range for loadings
                              ###############################################
                              ##For correlations
                              colCorr = "blue", ##Color vector for correlations
                              pchCorr = 16, ##Plotting character for correlations
                              whichCorr = NULL, ##Which correlations to plot (vector of numbers)
                              corLim = NULL, ##Plot range for correlations
                              corrLabels = TRUE, 
                              ###############################################
                              ##For risks
                              colRisk = "red", ##Color vector for risk estimates
                              pchRisk = 15, ##Plotting character for risk estimates
                              whichRisk = NULL, ##Which risk estimates to plot (vector of numbers)
                              riskLim = NULL, ##Plot range for risks
                              riskWhiskerPercentage = 0.1, ## whisker length is how many percentage of confidence interval
                              riskLabels = TRUE, size = 3, # the size of points ont he plot
                              riskOR = TRUE
                              # plot the risk estimates as odds ratio
                              )

We then visualize the triplot.

Show the code

TPO_plots3$triplot

plot

In the triplot visualization, black arrows represent loadings of Omics variables, forming patterns in different modules (axes). Blue circle points show exposure correlations with the modules, while red square points indicate outcome associations with the modules.

Note

Note that we use first and second module for our example here. In reality, you may use the checkTPO() function in TriplotGUI to view which modules may be more of your interest (that are associated with specific exposures and outcomes).

Based on the triplot visualization of the TriplotGUI interface, we can interpret the following key points:

The first (top right) and third (bottom left) dimension generally represents a spectrum of dietary healthiness. Healthier foods (e.g., "Fruits", "Vegetables") and higher healthy food indices ("HFI", "BSDS") are positioned on one side in the first dimension, while less healthy foods (e.g., "Hamburger", "Sausage", "Margarine") are on the opposite side in the third dimension.
The second (top left) and fourth (bottom right) dimensions are associated with health outcomes, including the numeric outcome "BMI", binary outcome "T2D", and one-hot-encoded categorical "BMI_cat" outcome.
Explanation for some variables in the plot:
- "HFI" stands for healthy food index. A higher "HFI" suggests healthier diet. "HFI" ranges from 0 to 6. You can see that there are HFI_0, HFI_1…HFI_6 in the figure and that is because "HFI" is used as a factor exposure variable in exposure’s correlation and is one-hot-encoded to the same number of binary variables as its number of levels. For example, HFI_6 is a binary variable where individuals with "HFI" as 6 (most healthy diet people) will be labeled as 1 and the rest of individuals will be labelled as 0. HFI_0 is a binary variable where individuals with "HFI" as 0 (most unhealthy diet people) will be labeled as 1 and the rest of individuals will be labelled as 0. HFI_1…HFI_5 has limited use since it only separates the group of people with certain "HFI" value from to the rest, which is a mixture of people with healthier and less healthier diet.
- "BSDS" stands for Baltic Sea Diet Score ranging. A higher "BSDS" suggests healthier diet. "BSDS" is used as a numeric variable ranging from 2 to 25 in the correlation.
- BMI_cat_obese, BMI_cat_overweight, BMI_cat_underweight are the odds ratio generated from the multinomial regression, since "BMI_cat" is a categorical outcome and multinomial regression is performed in outcome’s association in this example. Normal weight is uses as a reference and odds ratio of obese, overweight and underweight is produced. (Note that the first level of the factor variable is set as the default reference, in this case it is normal).
Adjusting for "Age", "Gender", "Smoking", "Education" and "FastingGlucose", the first module (on x-axis) correlates positively with generally healthy food (e.g. "fruits", "vegatables") and food index (e.g. "BSDS") and negatively with unhealthy food (e.g. "Hamburger", "pizza"). It also reversely associated with "T2D" and "BMI". From the result from "BMI_cat", this components also reversely associated with being obese.
Adjusting for "Age", "Gender", "Smoking", "Education" and "FastingGlucose", the second module (on y-axis) associated positively with generally healthy food (e.g. "fruits", "vegatables") and food index (e.g. "BSDS") and negatively with unhealthy food (e.g. "Hamburger", "pizza"). However, the module associated also positively with "T2D","BMI", obese, overweight and negatively associated with underweight.

Based on what we summarized above, the two modules may be associated with a overall healthy eating pattern. And in this healthy eating pattern, the first module is associated more with the beneficial effect on health and the second module represents the adverse health effect.

Step 5: Mediation analysis and visualization

In mediation analysis in TriplotGUI, we focus on a specific set of variables for mediation analysis: we select "BSDS" (Baltic Sea Diet Score) and "Hamburger" as our exposures, "BMI" and "T2D" as outcome to enter our next step. We choose first and second modules as our mediators.

This time, we perform the mediation analysis with the getMediationCounterfactual() function, using the counterfactual/potential outcome method implemented through the mediation package in R on our exposures (i.e. "BSDS", "Hamburger"), mediators (i.e. the first and second modules) and outcome (i.e. "T2D", "BMI") of interest, adjusting for "Age", "Gender", "Smoking", "Education" and "FastingGlucose" for both exposure-mediator and mediator-outcome relationship.

Show the code

mediation_object3 <- 
  getMediationCounterfactual (mediator = TPObject3$scores[,c(1:2),drop = FALSE], 
                             # Specfiying at least 2 components so that there can be a 2-dimensional plot
                               exposure=exposure2[,c("Hamburger","BSDS"),drop = FALSE],
                             outcome = outcome2[,c("BMI","T2D"),drop = FALSE],
                             exposureTreatment = c(25,0),
                             exposureControl = c(2,19),
                             confounderME = covariates2,
                             confounderOE = covariates2)

Note

In the counterfactual mediation analysis, contrast values of treatment and controls for each exposure needs to be specified. For each individual, the analysis considers what would happen if they received the treatment versus if they received the control. In brief, the algorithm uses these contrast values to calculate the potential outcomes under each scenario to estimate the direct and indirect effects. For a detailed explanation please refer to this paper of the mediation R package.

If an exposure variable is continuous (numeric variable), it is recommended that the treatment and control contrast values are chosen between the range of the exposure variable. If an exposure variable is categorical (factor variable), the treatment and control contrast values should be chosen from the levels of the exposure variable.

Then we can also Look at the barplot for mediations, which provides a clear visual representation of the direct, indirect, and total effects for each combination of exposure, mediator (principal component), and outcome. This visualization allows for a more intuitive understanding of the mediation analysis results.

Show the code

mediation_plots3 <- mediationBarplot(mediationObject = mediation_object3,
                                         cex = 2#,
                                         # size of the text
                                         #by_row = "one_column"
                                         )

Show the code

mediation_plots3

plot

Note

The barplot displays direct, indirect, and total effects for each exposure-mediator-outcome combination, offering a convenient tool to assess the direction and magnitude of mediation estimates.

Color coding enhances interpretation: red indicates significant positive effects (p<0.05), cyan shows significant negative effects (p<0.05), and grey represents insignificant effects.
Significance levels are denoted by stars: one star for p<0.05, two stars for p<0.01, and three stars for p<0.001.

We will make an example interpretation of the mediation barplot using BSDS as exposure, module1 as mediator (the module on the x axis in the triplot in step 4) and BMI as outcome.

In the shown mediation barplot, we can observe the direction and magnitude of the indirect effect and proportion mediated. We observed that with all effects significant, indirect effect is negative, direct and total effect is positive. The negative indirect effect is consistent with the its visualization in the triplot, i.e, the module1 on the x axis associative with BSDS positively and with BMI negatively.

However, the figure also shows that BSDS is positively associated with BMI through a larger direct effect. This seems counterintuitive but this could be due to a true direct effect or indirect effect mediated by other modules.

Based on the mediation barplot, we’ve also observed significant proportion mediated with for BSDS-module1-BMI mediations, suggesting a robust mediation effect.

Note

\[PM=\frac{IE}{IE+DE}\]

\[APM=\frac{IE}{|IE|+|DE|}\cdot\frac{|IE+DE|}{IE+DE}=\frac{IE}{IE+DE}\cdot\frac{|IE+DE|}{|IE|+|DE|}=\frac{IE}{|IE|+|DE|}\cdot\frac{IE+DE}{|IE+DE|}\]

Notably, the APM for BSDS (Baltic Sea Diet Score) as the exposure and T2D (Type 2 Diabetes) as the outcome is smaller than their corresponding PM (considering the abosolute value). This difference occurs because the indirect and direct effects for this mediation pathway are in opposite directions. This opposition leads to a smaller total effect and consequently a larger PM (considering the abosolute value). However, the APM is not affected by the directionality of the direct and indirect effects, providing a more stable measure of mediation.

The APM is always smaller than 1, even when direct and indirect effects are in opposite directions. This is due to the calculation method of APM, which uses the sum of the absolute values of indirect and direct effects as the denominator.

To further explore the indirect, direct, and total effects of different combinations of mediations, you can run more mediation analysis at Step 5 in the TriplotGUI interface .

Add the mediation result from getMediationCounterfactual() into the TPObject.

Show the code

TPObject4 <- addMediation(TPObject = TPObject3, 
                          mediationObject = mediation_object3)


Adding mediation to TPO
-------------------------

TPO mediation has used 2 mediators, 2 exposures, 2 outcomes
Mediators are module1, module2 
Exposures are Hamburger, BSDS 
Outcomes are BMI, T2D

Comparative Visualization

The checkTPO() function in TriplotGUI allows users to generate heatmaps from the TPObject to visualize correlations, risk estimations, and mediation results.

Show the code

checkTPObject4 <- checkTPO(TPObject4,
                                  mediatorEstimates = c("IE","TE","PM","APM"),
                                  ## What type of mediation estimate or proportion mediated to show on the heatmap
                                  mediators = c("module1", "module2"),
                                  ## What mediators (principal component) you want to show
                                  exposures = c("Hamburger","BSDS"),
                                  ## What exposure variables you want to show
                                  outcomes = c("T2D","BMI")
                                  ## What outcome variables you want to show
                                  )

Correlation coefficients and p values between modules and exposures:

Show the code

checkTPObject4$corr_coefficients

plot

Risk estimates(beta coefficients) and p values between modules and outcomes:

Show the code

checkTPObject4$risk_coefficients

plot

Mediation estimates and p values for indirect effect(IE) and total effect(TE), proportion mediated(PM) and adjusted proportion mediated(APM): Read Manual for more details:

Show the code

checkTPObject4$med_coefficients

plot

Note

Significance is indicated as follows. One star for p<0.05; Two stars for p<0.01; Three stars for p<0.001.
The first two plots, correlation coefficients with exposures and risk estimates of outcomes will show all the exposure and outcome variables that are not removed from visualization.
Note that in the plot of checkTPbject4$med_coefficients, nly the selected exposures, mediators and outcomes that we use to do mediation on will show up in the mediation results. The rows in the heatmaps are mediators and the column represents exposure-outcome pairs. For each exposure-outcome pair, 4 results are shown: IE (indirect effect), TE (total effect), PM(Proportion Mediated) and APM(Adjusted Proportion Mediated). Please read Manual for more information. The “CF” before the names of exposures-outcome pairs means that this mediation is using the Counterfactual method.

Data download

To save all your output, including data, results, and visualization output as an .rda file, you can use the save() function as an rda file.

Show the code

save(exposure2,Omics2,outcome2,covariates2,auxilariy2,
     reduced_Omics2,Correlations_object,Risks_object,
     mediation_object3,mediation_plots3,
     TPObject1,TPObject2,TPObject3,TPObject4,
     checkTPObject4,
     "Tutorial_complex_output.rda")

--- title: "Tutorial(complex) - Code" filters: - lightbox lightbox: auto subtitle: "Complex demo of using TriplotGUI's code" title-block-banner: true title-block-banner-color: white title-block-categories: false page-layout: full format: html: margin-top: 0em margin-bottom: 0em padding-top: 0em padding-bottom: 0em minimal: true smooth-scroll: true fig-responsive: true toc-location: right toc-depth: 5 toc-title: Tutorial #number-sections: true ## add number in the session or not #number-depth: 4 code-fold: show code-summary: "Show the code" code-overflow: wrap code-tools: true code-copy: true highlight: tango df-print: paged ## kable is another option How the dataframe looks like standalone: false ### specifies if all assets and libraries must be integrated into the output html file as a standalone document. fig-align: right theme: solar geometry: margin=lin --- ```{r, include=FALSE} ##This is a function that combines html and rmarkdown. It aims to generate a button that could hide output or plot (fold.output=TRUE or fold.plot=TRUE). hooks = knitr::knit_hooks$get() hook_foldable = function(type) { force(type) function(x, options) { res = hooks[[type]](x, options) if (!isTRUE(options[[paste0("fold.", type)]])) return(res) paste0( "<details><summary>", type, "</summary>\n\n", res, "\n\n</details>" ) } } knitr::knit_hooks$set( output = hook_foldable("output"), plot = hook_foldable("plot") ) ``` The code at this page can be downloaded [here](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/Tutorial_code/HealthyNordicDiet_2_applying_TriplotGUI_code.R){target="_blank"} Before starting, please ensure you have installed the TriplotGUI package following the [Setup](installation.qmd){target="_blank" style="color:blue"} instructions. # Data description The data used in this tutorial section can be downloaded [here](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/data/HealthyNordicDiet_2.rda){target="_blank"} in rda format. This example utilizes ***HealthyNordicDiet_2***, which is a modified version of the **HealthyNordicDiet** data provided by the [*Triplot*](https://gitlab.com/CarlBrunius/triplot/-/blob/master/vignettes/Triplot_Tutorial.md?ref_type=heads){target="_blank"} package (Schillemans et al. 2019). The **HealthyNordicDiet** is a synthetic dataset that simulates data from a case-control study within the Swedish Västerbotten Intervention Programme cohort, exploring plasma metabolites reflecting healthy Nordic dietary patterns and their association with type 2 diabetes development (Shi et al. 2018). The simulated ***HealthyNordicDiet_2*** data consists of three data frames: - Clinical measurements (ClinData): Contains 11 variables: `"ID"`,`"T2D"`(Cases and controls for type 2 diabetes), `"PairedInfo"`(Pairing/matching information for case and controls, a unique number is assigned to each case-control pairs),`"Energy"`,`"Gender"`,`"Age"`,`"BMI"`,`"Smoking"`,`"PhysicalActivityIndex"`,`"Education"`,`"FastingGlucose"`.\ - Plasma metabolites predictive of BMI (MetaboliteData): Comprises 20 variables - Dietary intake as measured by food frequency questionnaires (FoodData): Includes 11 variables: `"HFI"`(Healthy Food Index),`"BSDS"`(the Baltic Sea Diet Score),`"Wholegrains"`,`"Sausage"`,`"Pizza"`,`"Refined.bread"`, `"Fruits"`,`"Liquor"`,`"Wine"`,`"Hamburger"`,`"Poultry"`,`"Fish"`,`"Margarine"`,`"Cabbage"`,`"Carrot"`,`"low_fat_Dairy"`,`"Vegetables"` The data frames are row-wise matched by observations and consist of 1000 synthetic observations. **Reference** Shi, L., Brunius, C., Johansson, I., Bergdahl, I. A., Lindahl, B., Hanhineva, K., & Landberg, R. (2018). Plasma metabolites associated with healthy Nordic dietary indexes and risk of type 2 diabetes—a nested case-control study in a Swedish population. *The American Journal of Clinical Nutrition*, 108(3), 564-575. [10.1093/ajcn/nqy145](https://www.sciencedirect.com/science/article/pii/S0002916522029628?via%3Dihub){target="_blank"} Schillemans T, Shi L, Liu X, Åkesson A, Landberg R, Brunius C. Visualization and Interpretation of Multivariate Associations with Disease Risk Markers and Disease Risk-The Triplot. *Metabolites*. 2019 Jul 6;9(7):133. [doi: 10.3390/metabo9070133](https://www.mdpi.com/2218-1989/9/7/133){target="_blank"} # Research question We want to assess the relationship between diet, metabolic profiles and Type 2 diabetes and BMI. We assume that metabolites serves as mediators, mediating the effect of dietary exposures on Type 2 diabetes and BMI. # Data exploration ***HealthyNordicDiet_2*** is loaded in the R environment upon running `library(TriplotGUI)`. Let's begin with some data exploration to understand the structure and contents of the dataset: ```{r noshowcode_runcode_noshowoutput_1} #| eval: true #| echo: false #| output: false #| classes: styled-output library(TriplotGUI) ``` ## Check HealthyNordicDiet_2 Check the ***HealthyNordicDiet_2*** list: ```{r showcode_runcode_showoutput_1} #| eval: true #| echo: true #| output: true #| classes: styled-output class(HealthyNordicDiet_2) names(HealthyNordicDiet_2) ``` ## Check datasets Check the names of variables in each data: ```{r showcode_runcode_showoutput_2} #| eval: true #| echo: true #| output: true #| classes: styled-output colnames(HealthyNordicDiet_2$ClinData) colnames(HealthyNordicDiet_2$MetaboliteData) colnames(HealthyNordicDiet_2$FoodData) ``` ## Check variables' class We then transform the data to dataframe format and use TriplotGUI's `checkdata()` function to examine the classes of variables. ```{r showcode_runcode_noshowoutput_1} #| eval: true #| echo: true #| output: false #| classes: styled-output ClinData <- as.data.frame(HealthyNordicDiet_2$ClinData) MetaboliteData <- as.data.frame(HealthyNordicDiet_2$MetaboliteData) FoodData <- as.data.frame(HealthyNordicDiet_2$FoodData) ClinData_check <- checkdata(ClinData) MetaboliteData_check <- checkdata(MetaboliteData) FoodData_check <- checkdata(FoodData) ``` ```{r showcode_runcode_showoutput_3} #| eval: true #| echo: true #| output: true #| classes: styled-output ClinData_check$class_summary_statistics MetaboliteData_check$class_summary_statistics FoodData_check$class_summary_statistics ``` ::: callout-note Note that in the given data, `"HFI"`(Healthy Food Index) is a factor variable and `"BSDS"`(the Baltic Sea Diet Score) is a numeric variable. You could use `table(FoodData$HFI)` and `table(FoodData$BSDS)` to check the difference. ::: ## Check sanities for variables To check for missing (NA) or abnormal values (e.g., NaN, negative values, infinite values, blank values) in each variable, you can use the `checkdata()` function. This function generates a summary table for each data frame, showing the number of observations containing NA, NaN, zero, negative, infinite (Inf), and blank values, along with their percentages. ```{r showcode_runcode_noshowoutput_1_1} #| eval: true #| echo: true #| output: false #| classes: styled-output #| if (!require(kableExtra)) { install.packages("kableExtra") # Install the package if it's not available. } library(kableExtra) # Load it after installation. ``` ```{r showcode_runcode_showoutput_4} #| eval: true #| echo: true #| output: true #| classes: styled-output kable(ClinData_check$everycolumn) kable(MetaboliteData_check$everycolumn) kable(FoodData_check$everycolumn) ``` ::: callout-note You shall only continue to use the data when the class of variables are correct and the missing or abnormal values in the variable are properly handled. ::: ## Build data for analysis We treat food variables as our exposures and BMI and type 2 diabetes as outcomes. Note that BMI is a continuous variable. We can additionally construct BMI as a categorical outcome using the following criteria: - BMI \<= 18.5: Underweight - 18.5 \<BMI \<= 25: Normal - 25 \< BMI \<= 30: Overweight - 30 \< BMI: Obese ```{r showcode_runcode_showoutput_4_1} #| eval: true #| echo: true #| output: true #| classes: styled-output BMI <- ClinData[,"BMI"] BMI_cat <- cut(BMI, breaks = c(-Inf, 18.5, 25, 30, Inf), labels = c('underweight', 'normal', 'overweight', 'obese'), right = TRUE) BMI_cat<-factor(BMI_cat,levels=c("normal","underweight", "overweight", "obese")) ``` We want to explore the exposure-outcome relationships through the metabolomics data, using metabolites as assumed mediators. Food data serves as exposures, while BMI (both continuous and categorical) and Type 2 Diabetes are treated as outcomes. Gender, age, education, physical activity, and smoking are used as potential confounders for the exposure-mediator and mediator-outcome associations. Additionally, auxiliary information regarding case-control matches, fasting glucose levels, education, and physical activity is provided to enhance the analysis. ```{r showcode_runcode_showoutput_5} #| eval: true #| echo: true #| output: true #| classes: styled-output exposure2 <- FoodData Omics2 <- MetaboliteData outcome2 <- cbind.data.frame(ClinData[,c("T2D","BMI")], BMI_cat) covariates2 <- ClinData[, c("Gender" ,"Age","Education","PhysicalActivityIndex","Smoking" )] auxiliary2 <- ClinData[, c("PairedInfo","FastingGlucose","Education","PhysicalActivityIndex", "Energy")] ``` # Code example of using TriplotGUI We will use more complex settings in this example. Please go to [Tutorial(simple)](simple_code.qmd){target="_blank"}.if you want to try simpler settings. ## Step 1: Data reduction of Omics variables {#simple_link} Using `TriplotGUI` package, first we perform dimension reduction, using weighted correlation network analysis (WGCNA) on metabolomics data. ```{r showcode_runcode_noshowoutput_2} #| eval: true #| echo: true #| output: false #| classes: styled-output # Using the complex settings reduced_Omics2 <- OmicsReduction(dataframe = Omics2, plotType = c("scree","score","loading","scoreloading"), pcType = "principal", # Users can choose freely to use principal() instead of prcomp() to perform PCA.The result should be very similar, but principal() allows more functionalities firstPC = 1, secondPC= 2, option = "WGCNA", # In this example we use WGCNA to perform data reduction, therefore specifying `option="WGCNA"`. scale = TRUE, center = TRUE, eigenLoading = "loading", # If you want to show loading or scale it to eigevalues. rotate = "none", # This argument is inherited from the principal() function sizeVariable = auxiliary2$FastingGlucose, sizeVariableName = "FastigGlucose", colorVariable = auxiliary2$Education, colorVariableName = "Education", shapeVariable = auxiliary2$PhysicalActivityIndex, shapeVariableName = "PhysicalActivityIndex", # The size,shape,color variables are used to differentiate points on the score plot and biplot loadingsName = TRUE, # Show the names of loadings loadingsCutpercent = 0.2, # The loadings bigger than 20% of all the loadings will show up minModuleSize = 3 # The mimimum number of variable in each cluster of WGCNA ) ``` ::: callout-note - In WGCNA, Omics variables are grouped into clusters. PCA is performed on each cluster, and the first principal component of each cluster is used as the module for WGCNA. - You could try out the arguments of the `OmicsReduction()` function to visualize the results and view the whole function [here](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/R/PCAorWGCNA_plots.R?ref_type=heads){target="_blank"} - The `minModuleSize` argument in `OmicsReduction()` is crucial as it defines the minimum number of variables in each cluster. Setting this too high may result in all variables being grouped into a single cluster. For instance, with 20 variables and `minModuleSize=15`, you might end up with one cluster containing all 20 variables. In this case, only the first principal component of this single cluster would be used and displayed on both axes of the resulting triplots. To avoid this, adjust `minModuleSize` based on your total variable count and desired cluster granularity. Experiment with different values to achieve a balance between meaningful clusters and interpretable results. ::: You can see scree plot, score plot, loadings plot and biplot at this stage. ```{r showcode_runcode_showoutput_6A} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" reduced_Omics2$scree_plot ``` The scree plot in WGCNA visualizes the variance explained by each cluster. Clusters are color-coded, with column heights representing the proportion of total variance explained by each cluster. The darker color in each column shows the variance explained by the first principal component (module) of that cluster. The grey cluster, showing only a darker color, indicates it contains just one variable, which serves as its own first principal component/module. This can occur even with `minModuleSize=2`, as some variables may be left unclustered. Clusters are arranged from left to right based on the number of variables they contain, with the leftmost cluster having the most variables and the rightmost having the least. This arrangement helps in quickly identifying the relative sizes and importance of different clusters in the dataset. ```{r showcode_runcode_showoutput_6B} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" reduced_Omics2$score_plot ``` In the score plot, observations are colored by education level, with different shapes representing varying physical activity indices and different sizes indicating fasting glucose levels. This visualization allows for an integrated view of how these characteristics relate to the modules derived from the WGCNA analysis, highlighting patterns among individuals based on their traits. ```{r showcode_runcode_showoutput_6C} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" reduced_Omics2$loading_plot ``` The loading plot displays loadings aligned along the axes, reflecting the use of first principal components/modules from each cluster in WGCNA. This alignment occurs because each axis represents a distinct principal component analysis, making it impossible for loadings to point in the direction of another axis. ```{r showcode_runcode_showoutput_6D} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" reduced_Omics2$scoreloading_plot ``` ::: callout-note Sometimes the following text will show up in your console and the figure cannot be directly run: `libpng warning: Image width is zero in IHDR` `libpng error: Invalid IHDR data` `Warning message:` `In dev.off(which) : agg could not write to the given file\` Run `gc()` in your console to clear the memory. Then rerun the code again to view the figure. ::: Although informative, the biplot combining both score and loading can appear cluttered due to the high density of information it contains. A TPObject is constructed based on the data reduction results from the WGCNA analysis. This object serves as a central repository for storing and trandfering information generated through the various steps during the analysis. ```{r showcode_runcode_showoutput_7} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.output: false scores <- reduced_Omics2$object$scores loadings <- reduced_Omics2$object$loadings variance <- reduced_Omics2$object$variance TPObject1 <- makeTPO(scores = scores, loadings = loadings, variance = variance) ``` ## Step 2-3: Exposure-Omics and Omics-Outcome associations The associations between modules and food items and the associations between modules and risk markers can be then calculated, with confounders adjusted. Note that it is possible for users to skip code regarding exposure association or risk association, if their interests focus on only outcome associations or exposure associations with Omics variables. The modules information saved in the TPObject can be directly utilized in the `exposureAssociation()` function. This function calculates exposure associations coefficients and p-values between the modules and exposures (food variables and diet scores), while adjusting for specified confounders. ```{r showcode_runcode_noshowoutput_3} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.output: false Correlations_object <- exposureAssociation(TPObject = TPObject1, exposure = exposure2, use = 'pairwise', # use inherit from cor() function method = 'pearson', # method inherit from cor() function allowCategorical = FALSE, # one-hot-encoding on categorical exposures partial = TRUE, # perform confonder adjustment confounder = covariates2) ``` ::: callout-note - The `exposureAssociation()` function offers options for handling categorical variables and adjusting for confounders. When setting `allowcategorical=FALSE`, categorical exposures with more than two classes undergo one-hot-encoding, transforming them into binary variables. This can be verified by examining `Correlations_object$cor_estimate`, which shows separate correlation coefficients for each binary representation of HFI levels. However, `allowcategorical=FALSE` is limited for meaningful use. The binary variables may be only meaningful when we want to compare one level to all the other levels together. We use this option for categorical variable `"HFI"` for demonstration purpose. Ideally, one should use `allowcategorical=TRUE` for HFI. Conversely, `allowcategorical=TRUE` leaves categorical variables like `"HFI"` untransformed. In this case, linear models regressing modules on `"HFI"` are used for correlation analysis. The correlation coefficients are derived from the square root of the proportion of HFI's sum of squares to the total sum of squares. - Setting `partial=TRUE` ensures that specified confounders are adjusted for in the correlation analysis. ::: The correlation results obtained from the `exposureAssociation()` function are then incorporated into the TPObject. ```{r showcode_runcode_showoutput_8} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.output: false TPObject2 <- addExposure(TPObject = TPObject1, corrEstimate = Correlations_object$cor_estimate, corrPvalue = Correlations_object$cor_pvalue) ``` The modules information saved in the TPObject can be directly utilized in the `outcomeAssociation()` function. This function calculates risk estimates and p-values between the modules and outcomes (BMI, type 2 diabetes and categorical BMI), while adjusting for specified confounders. ```{r showcode_runcode_noshowoutput_4} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.output: false Risks_object <- outcomeAssociation(TPObject = TPObject2, outcome = outcome2, confounder = covariates2, partial = TRUE, # performs confounder adjustment multinomial = TRUE, # multinomial regression is used pair = auxiliary2$PairedInfo, # use the pairing information of case and control CI = 0.95 # confidence interval ) ``` ::: callout-note The `outcomeAssociation()` function offers various options for handling outcomes and adjusting for confounders, with behavior differing based on whether pair information is provided. - Without `pair` information: - `multinomial=FALSE` transforms categorical outcomes with \>2 classes into binary variables via one-hot-encoding. Logistic regression is used for binary outcomes, and linear regression for continuous outcomes. - `multinomial=TRUE` applies multinomial regression to categorical outcomes with \>2 classes, yielding n-1 estimates. Binary and continuous outcomes are handled as before. - With pair information: - `multinomial=FALSE` transforms categorical outcomes as above, then uses conditional logistic regression for binary outcomes and linear mixed models for continuous outcomes, incorporating pairing information. - `multinomial=TRUE` ignores pairing information, applying multinomial regression to categorical outcomes and standard logistic/linear regression to binary/continuous outcomes. - Setting `partial=TRUE` ensures adjustment for specified confounders in all analyses. ::: The results for risk estimates obtained from the `outcomeAssociation()` function are then incorporated into the TPObject. ```{r showcode_runcode_showoutput_9} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.output: false TPObject3 <- addOutcome(TPObject = TPObject2, Risk = Risks_object) ``` ## Step 4: Meet-in-the-middle Triplot Visualization The TPObject created from all steps can be used to generate a Triplot using the `TriplotGUI()` function. The function serves as a wrapper for `PCA_TriplotGUI()` and `WGCNA_TriplotGUI()`. You can check the functions respectively here: [`TriplotGUI()`](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/R/TriplotGUI.R?ref_type=heads){target="_blank"}, [`PCA_TriplotGUI()`](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/R/PCA_TriplotGUI.R?ref_type=heads){target="_blank"}, [`WGCNA_TriplotGUI()`](https://gitlab.com/YingxiaoYan/triplotgui/-/blob/main/R/WGCNA_TriplotGUI.R?ref_type=heads){target="_blank"}. ```{r showcode_runcode_noshowoutput_5} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.output: false TPO_plots3 <- TriplotGUI(TPObject3, firstPC = 1, ## The first PC to map secondPC = 2, ## The first PC to map plotLoads = TRUE, ##Whether to plot loadings (TRUE; default) or suppress them (FALSE) plotScores = FALSE, ##Whether to plot scores (TRUE) or suppress them (FALSE; default) plotCorr = TRUE, ##Whether to plot correlations (TRUE; default) or suppress them (FALSE) plotRisk = TRUE, ##Whether to plot risk estimates (TRUE; default) or suppress them (FALSE) ############################################### ##For loadings loadLabels = TRUE, ###Whether to plot variable loading labels (TRUE; default) or not (FALSE) loadArrowLength = 0.02, ###Length of arrow tip , set it as 0 if you want to remove it loadCut = 0, ###lower limit Loadings below the cut are plotted in light grey and without label loadLim = NULL, ##higher limit,Plot range for loadings ############################################### ##For correlations colCorr = "blue", ##Color vector for correlations pchCorr = 16, ##Plotting character for correlations whichCorr = NULL, ##Which correlations to plot (vector of numbers) corLim = NULL, ##Plot range for correlations corrLabels = TRUE, ############################################### ##For risks colRisk = "red", ##Color vector for risk estimates pchRisk = 15, ##Plotting character for risk estimates whichRisk = NULL, ##Which risk estimates to plot (vector of numbers) riskLim = NULL, ##Plot range for risks riskWhiskerPercentage = 0.1, ## whisker length is how many percentage of confidence interval riskLabels = TRUE, size = 3, # the size of points ont he plot riskOR = TRUE # plot the risk estimates as odds ratio ) ``` We then visualize the triplot. ```{r showcode_runcode_showoutput_10} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" TPO_plots3$triplot ``` In the triplot visualization, black arrows represent loadings of Omics variables, forming patterns in different modules (axes). Blue circle points show exposure correlations with the modules, while red square points indicate outcome associations with the modules. ::: callout-note Note that we use first and second module for our example here. In reality, you may use the `checkTPO()` function in TriplotGUI to view which modules may be more of your interest (that are associated with specific exposures and outcomes). ::: Based on the triplot visualization of the TriplotGUI interface, we can interpret the following key points: - The first (top right) and third (bottom left) dimension generally represents a spectrum of dietary healthiness. Healthier foods (e.g., `"Fruits"`, `"Vegetables"`) and higher healthy food indices (`"HFI"`, `"BSDS"`) are positioned on one side in the first dimension, while less healthy foods (e.g., `"Hamburger"`, `"Sausage"`, `"Margarine"`) are on the opposite side in the third dimension. - The second (top left) and fourth (bottom right) dimensions are associated with health outcomes, including the numeric outcome `"BMI"`, binary outcome `"T2D"`, and one-hot-encoded categorical `"BMI_cat"` outcome. - Explanation for some variables in the plot: - `"HFI"` stands for healthy food index. A higher `"HFI"` suggests healthier diet. `"HFI"` ranges from 0 to 6. You can see that there are *HFI_0*, *HFI_1*...*HFI_6* in the figure and that is because `"HFI"` is used as a factor exposure variable in exposure's correlation and is one-hot-encoded to the same number of binary variables as its number of levels. For example, *HFI_6* is a binary variable where individuals with `"HFI"` as 6 (most healthy diet people) will be labeled as 1 and the rest of individuals will be labelled as 0. *HFI_0* is a binary variable where individuals with `"HFI"` as 0 (most unhealthy diet people) will be labeled as 1 and the rest of individuals will be labelled as 0. *HFI_1*...*HFI_5* has limited use since it only separates the group of people with certain `"HFI"` value from to the rest, which is a mixture of people with healthier and less healthier diet. - `"BSDS"` stands for Baltic Sea Diet Score ranging. A higher `"BSDS"` suggests healthier diet. `"BSDS"` is used as a numeric variable ranging from 2 to 25 in the correlation. - *BMI_cat_obese*, *BMI_cat_overweight*, *BMI_cat_underweight* are the odds ratio generated from the multinomial regression, since `"BMI_cat"` is a categorical outcome and multinomial regression is performed in outcome's association in this example. Normal weight is uses as a reference and odds ratio of obese, overweight and underweight is produced. (Note that the first level of the factor variable is set as the default reference, in this case it is *normal*). - Adjusting for `"Age"`, `"Gender"`, `"Smoking"`, `"Education"` and `"FastingGlucose"`, the first module (on x-axis) correlates positively with generally healthy food (e.g. `"fruits"`, `"vegatables"`) and food index (e.g. `"BSDS"`) and negatively with unhealthy food (e.g. `"Hamburger"`, `"pizza"`). It also reversely associated with `"T2D"` and `"BMI"`. From the result from `"BMI_cat"`, this components also reversely associated with being obese. - Adjusting for `"Age"`, `"Gender"`, `"Smoking"`, `"Education"` and `"FastingGlucose"`, the second module (on y-axis) associated positively with generally healthy food (e.g. `"fruits"`, `"vegatables"`) and food index (e.g. `"BSDS"`) and negatively with unhealthy food (e.g. `"Hamburger"`, `"pizza"`). However, the module associated also positively with `"T2D"`,`"BMI"`, *obese*, *overweight* and negatively associated with *underweight*. Based on what we summarized above, the two modules may be associated with a overall healthy eating pattern. And in this healthy eating pattern, the first module is associated more with the beneficial effect on health and the second module represents the adverse health effect. ## Step 5: Mediation analysis and visualization In mediation analysis in TriplotGUI, we focus on a specific set of variables for mediation analysis: we select `"BSDS"` (Baltic Sea Diet Score) and `"Hamburger"` as our exposures, `"BMI"` and `"T2D"` as outcome to enter our next step. We choose first and second modules as our mediators. This time, we perform the mediation analysis with the `getMediationCounterfactual()` function, using the counterfactual/potential outcome method implemented through the `mediation` package in R on our exposures (i.e. `"BSDS"`, `"Hamburger"`), mediators (i.e. the first and second modules) and outcome (i.e. `"T2D"`, `"BMI"`) of interest, adjusting for `"Age"`, `"Gender"`, `"Smoking"`, `"Education"` and `"FastingGlucose"` for both exposure-mediator and mediator-outcome relationship. ```{r showcode_runcode_noshowoutput_6} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.output: false mediation_object3 <- getMediationCounterfactual (mediator = TPObject3$scores[,c(1:2),drop = FALSE], # Specfiying at least 2 components so that there can be a 2-dimensional plot exposure=exposure2[,c("Hamburger","BSDS"),drop = FALSE], outcome = outcome2[,c("BMI","T2D"),drop = FALSE], exposureTreatment = c(25,0), exposureControl = c(2,19), confounderME = covariates2, confounderOE = covariates2) ``` ::: callout-note In the counterfactual mediation analysis, contrast values of treatment and controls for each exposure needs to be specified. For each individual, the analysis considers what would happen if they received the treatment versus if they received the control. In brief, the algorithm uses these contrast values to calculate the potential outcomes under each scenario to estimate the direct and indirect effects. For a detailed explanation please refer to [this paper](https://www.jstatsoft.org/article/view/v059i05){target="_blank"} of the `mediation` R package. If an exposure variable is continuous (numeric variable), it is recommended that the treatment and control contrast values are chosen between the range of the exposure variable. If an exposure variable is categorical (factor variable), the treatment and control contrast values should be chosen from the levels of the exposure variable. ::: Then we can also Look at the barplot for mediations, which provides a clear visual representation of the direct, indirect, and total effects for each combination of exposure, mediator (principal component), and outcome. This visualization allows for a more intuitive understanding of the mediation analysis results. ```{r showcode_runcode_noshowoutput_7} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.plot: true mediation_plots3 <- mediationBarplot(mediationObject = mediation_object3, cex = 2#, # size of the text #by_row = "one_column" ) ``` ```{r showcode_runcode_showoutput_11} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" mediation_plots3 ``` ::: callout-note The barplot displays direct, indirect, and total effects for each exposure-mediator-outcome combination, offering a convenient tool to assess the direction and magnitude of mediation estimates. - Color coding enhances interpretation: red indicates significant positive effects (p\<0.05), cyan shows significant negative effects (p\<0.05), and grey represents insignificant effects. - Significance levels are denoted by stars: one star for p\<0.05, two stars for p\<0.01, and three stars for p\<0.001. ::: We will make an example interpretation of the mediation barplot using BSDS as exposure, module1 as mediator (the module on the x axis in the triplot in step 4) and BMI as outcome. In the shown mediation barplot, we can observe the direction and magnitude of the indirect effect and proportion mediated. We observed that with all effects significant, indirect effect is negative, direct and total effect is positive. The negative indirect effect is consistent with the its visualization in the triplot, i.e, the module1 on the x axis associative with BSDS positively and with BMI negatively. However, the figure also shows that BSDS is positively associated with BMI through a larger direct effect. This seems counterintuitive but this could be due to a true direct effect or indirect effect mediated by other modules. Based on the mediation barplot, we've also observed significant proportion mediated with for BSDS-module1-BMI mediations, suggesting a robust mediation effect.  ::: callout-note  $$PM=\frac{IE}{IE+DE}$$ $$APM=\frac{IE}{|IE|+|DE|}\cdot\frac{|IE+DE|}{IE+DE}=\frac{IE}{IE+DE}\cdot\frac{|IE+DE|}{|IE|+|DE|}=\frac{IE}{|IE|+|DE|}\cdot\frac{IE+DE}{|IE+DE|}$$ - Notably, the APM for BSDS (Baltic Sea Diet Score) as the exposure and T2D (Type 2 Diabetes) as the outcome is smaller than their corresponding PM (considering the abosolute value). This difference occurs because the indirect and direct effects for this mediation pathway are in opposite directions. This opposition leads to a smaller total effect and consequently a larger PM (considering the abosolute value). However, the APM is not affected by the directionality of the direct and indirect effects, providing a more stable measure of mediation.  - The APM is always smaller than 1, even when direct and indirect effects are in opposite directions. This is due to the calculation method of APM, which uses the sum of the absolute values of indirect and direct effects as the denominator. ::: To further explore the indirect, direct, and total effects of different combinations of mediations, you can run more mediation analysis at Step 5 in the TriplotGUI interface . Add the mediation result from `getMediationCounterfactual()` into the TPObject. ```{r showcode_runcode_showoutput_12} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true TPObject4 <- addMediation(TPObject = TPObject3, mediationObject = mediation_object3) ```                                                                                                                                                                                 ## Comparative Visualization The `checkTPO()` function in TriplotGUI allows users to generate heatmaps from the TPObject to visualize correlations, risk estimations, and mediation results. ```{r showcode_runcode_noshowoutput_9} #| eval: true #| echo: true #| output: false #| classes: styled-output #| fold.plot: true checkTPObject4 <- checkTPO(TPObject4, mediatorEstimates = c("IE","TE","PM","APM"), ## What type of mediation estimate or proportion mediated to show on the heatmap mediators = c("module1", "module2"), ## What mediators (principal component) you want to show exposures = c("Hamburger","BSDS"), ## What exposure variables you want to show outcomes = c("T2D","BMI") ## What outcome variables you want to show ) ``` Correlation coefficients and p values between modules and exposures: ```{r showcode_runcode_showoutput_15A} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" checkTPObject4$corr_coefficients ``` Risk estimates(beta coefficients) and p values between modules and outcomes: ```{r showcode_runcode_showoutput_15B} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" checkTPObject4$risk_coefficients ``` Mediation estimates and p values for indirect effect(IE) and total effect(TE), proportion mediated(PM) and adjusted proportion mediated(APM): Read [Manual](workflow.qmd){target="_blank"} for more details: ```{r showcode_runcode_showoutput_15C} #| eval: true #| echo: true #| output: true #| classes: styled-output #| fold.plot: true #| out-width: "100%" checkTPObject4$med_coefficients ```           ::: callout-note - Significance is indicated as follows. One star for p\<0.05; Two stars for p\<0.01; Three stars for p\<0.001. - The first two plots, correlation coefficients with exposures and risk estimates of outcomes will show all the exposure and outcome variables that are not removed from visualization. - Note that in the plot of `checkTPbject4$med_coefficients`, nly the selected exposures, mediators and outcomes that we use to do mediation on will show up in the mediation results. The rows in the heatmaps are mediators and the column represents exposure-outcome pairs. For each exposure-outcome pair, 4 results are shown: *IE* (indirect effect), *TE* (total effect), *PM*(Proportion Mediated) and *APM*(Adjusted Proportion Mediated). Please read [Manual](https://yingxiaoyan.gitlab.io/triplotgui_tutorial/workflow.html#5.2){target="_blank"} for more information. The "CF" before the names of exposures-outcome pairs means that this mediation is using the Counterfactual method. ::: ## Data download To save all your output, including data, results, and visualization output as an .rda file, you can use the `save()` function as an rda file. ```{r showcode_noruncode_noshowoutput_1} #| eval: false #| echo: true #| warning: false #| error: true #| classes: styled-output save(exposure2,Omics2,outcome2,covariates2,auxilariy2, reduced_Omics2,Correlations_object,Risks_object, mediation_object3,mediation_plots3, TPObject1,TPObject2,TPObject3,TPObject4, checkTPObject4, "Tutorial_complex_output.rda") ```