Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plotRDA, plot loadings #92

Closed
TuomasBorman opened this issue Sep 7, 2023 · 13 comments
Closed

plotRDA, plot loadings #92

TuomasBorman opened this issue Sep 7, 2023 · 13 comments
Assignees

Comments

@TuomasBorman
Copy link
Contributor

In OMA, there is an example on how to plot loadings of features in dbRDA. We could consider to generalize this to plotting function if that is a common way to analyze which features drive the difference

@TuomasBorman
Copy link
Contributor Author

#87

@ElySeraidarian
Copy link
Contributor

ElySeraidarian commented Jul 4, 2024

I would like to provide some examples on how to plot these feature loadings after talking with @antagomir about this.

There would be 2 ways of plotting :

This requires the latest github version of mia.

  1. With tree, this option means that the tree need to be agglomerated to the same features as the loadings matrix we get after performing the reduction method (this is now possible for LDA & NMF methods as agglomerateByPrevalence has been fixed to be possible to agglomerate tree). Here are quick examples for each of these reduction methods :
  • PCA
library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
phylo <- rowTree(tse)
circ <- ggtree(phylo, layout = "circular")
df <- rowData(tse)
color <- randomcoloR::distinctColorPalette(
  length(
    unique(
      df$Phylum
    )
  )
)
df <- data.frame(Class = df$Phylum)
rownames(df) <- phylo$tip.label

df2 <- data.frame(abs(loadings_matrix))
rownames(df2) <- phylo$tip.label

# Plot the tree with first inner circle (Classes)
p <- gheatmap(
  p = circ,
  data = df,
  offset = -.1,
  width = .1,
  colnames_angle = 95,
  colnames_offset_y = .5,
  font.size = 4,
  color = "black") +
  ggplot2::scale_fill_manual(
    values = color,
    name = "Class"
  )
# Plot the feature loadings in different circles
for(i in 1:2){
  if(i == 1){
    p <- p +
      ggnewscale::new_scale_fill()
  }
  df3 <- dplyr::select(
    df2, (all_of(i))
  )

  p <- gheatmap(
    p,
    df3,
    offset = i*.065,
    width = .1,
    colnames_angle = 90,
    font.size = 4,
    high = "darkslateblue",
    low = "gray98",
    color = "black",
    legend_title = expression(beta[k]))
}

p
  • NMF
library(NMF)
library(mia)
library(ggtree)
data(GlobalPatterns)
tse <- GlobalPatterns
tse <- transformAssay(tse, method = "relabundance")
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Perform the NMF reduction method
x <- t(assay(tse, "counts"))
nmf2 <- nmf(x, 2)
H <- nmf2@fit@H
W <- nmf2@fit@W
# Get the feature loadings
feature_loadings <- t(H)
phylo <- rowTree(tse)
circ <- ggtree(phylo, layout = "circular")
df <- rowData(tse)
color <- randomcoloR::distinctColorPalette(
  length(
    unique(
      df$Phylum
    )
  )
)
df <- data.frame(Class = df$Phylum)
rownames(df) <- phylo$tip.label

df2 <- data.frame(feature_loadings)
rownames(df2) <- phylo$tip.label
# Plot the tree and the first inner circle (Classes)
p <- gheatmap(
  p = circ,
  data = df,
  offset = -.1,
  width = .1,
  colnames_angle = 95,
  colnames_offset_y = .5,
  font.size = 4,
  color = "black") +
  ggplot2::scale_fill_manual(
    values = color,
    name = "Class"
  )
# Plot the feature loadings in different circles
for(i in 1:2){
  if(i == 1){
    p <- p +
      ggnewscale::new_scale_fill()
  }
  df3 <- dplyr::select(
    df2, (all_of(i))
  )
  p <- gheatmap(
    p,
    df3,
    offset = i*.065,
    width = .1,
    colnames_angle = 90,
    font.size = 4,
    high = "darkslateblue",
    low = "gray98",
    color = "black",
    legend_title = expression(beta[k]))
}
p
  • LDA
library(mia)
library(topicmodels)
library(ggtree)

data(GlobalPatterns)
tse <- GlobalPatterns

tse <- transformAssay(tse, method = "relabundance")
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank = "Class", update.tree = TRUE, prevalence = 0.99)
# Perform LDA reduction method
lda_model <- LDA(t(assay(tse, "counts")), k = 2)

df <- as.data.frame(t(assay(tse, "counts")))

posteriors <- topicmodels::posterior(lda_model, df)
# Get the feature loadings
feature_loadings <- t(as.data.frame(posteriors$terms))


phylo <- rowTree(tse)
circ <- ggtree(phylo, layout = "circular")
df <- rowData(tse)
color <- randomcoloR::distinctColorPalette(
  length(
    unique(
      df$Phylum
    )
  )
)
df <- data.frame(Class = df$Phylum)
rownames(df) <- phylo$tip.label

df2 <- data.frame(feature_loadings)
rownames(df2) <- phylo$tip.label
# Plot the tree and first inner circles (Classes)
p <- gheatmap(
  p = circ,
  data = df,
  offset = -.1,
  width = .1,
  colnames_angle = 95,
  colnames_offset_y = .5,
  font.size = 4,
  color = "black") +
  ggplot2::scale_fill_manual(
    values = color,
    name = "Class"
  )
# Plot the feature loadings in different circles
for(i in 1:2){
  if(i == 1){
    p <- p +
      ggnewscale::new_scale_fill()
  }
  df3 <- dplyr::select(
    df2, (all_of(i))
  )
  p <- gheatmap(
    p,
    df3,
    offset = i*.065,
    width = .1,
    colnames_angle = 90,
    font.size = 4,
    high = "darkslateblue",
    low = "gray98",
    color = "black",
    legend_title = expression(beta[k]))
}
p

This is the exact same process for every reduction method, the only thing that changes is how the loadings matrix is retrieved and stored which is why wrappers discussed in microbiome/mia#596 could be helpful. As also discussed it might be possible to recognize the type of reduction method so that the user do not have to specify method when calling function.
Note that we agglomerate with a high prevalence in order to have only few features so that it can be understandable (and that the legend does not get out of the screen).

  1. Without tree, I have made an example for PCA, there are still things that can be changed (e.g. adding coefficients inside circles?, should we consider only absolute values?)
library(mia)
library(ggtree)
library(scater)
library(ggcorrplot)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")

# Plot feature loadings
ggcorrplot(abs(loadings_matrix),
           type = "lower",
           method="circle",
           title="Feature loadings",
           ggtheme=theme_bw) +
  scale_fill_gradient2(breaks=c(0, 1), limit=c(0, 1))

@ElySeraidarian ElySeraidarian self-assigned this Jul 4, 2024
@antagomir
Copy link
Member

Thanks @ElySeraidarian

Could we formulate a miaViz-style function for one of these, so that the plotting can be done on a single line of code, given that the user provides a TreeSE object that has a) feature loading matrix (in some position that we can change later, i.e. in rowData or in metadata or in attr) and b) (optionally) a rowTree.

In the first round at least we can assume that the user has themselves done the necessary agglomerations etc. There are so many considerations on those things that we cannot automated all possible scenarios on the data sizes anyway.

@TuomasBorman
Copy link
Contributor Author

Looks very nice and interesting. However, I am wondering if that is the most usual way to plot loadings ( I don't know). At least for eigenvalues the most usual way might be scree plot (bar plot). We could have function to plot both loadings and eigenvalues perhaps since both are needed to describe the ordination results fully.

library(mia)
library(ggplot2)
library(patchwork)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, method = "clr", pseudocount = 1)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", assay.type = "clr")
# Get the feature loadings
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
# Add taxonomy labels to column
df[["feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
# Get top 50 loadings
df <- df[ order(abs(df[["PC1"]]), decreasing = TRUE)[1:50], ]
# Order loadings and add factor to keep the order
df <- df[ order(df[["PC1"]]), ]
df[["feature"]] <- factor(df[["feature"]], levels = df[["feature"]])

# Loadings plot
p1 <- ggplot(df, aes(x = .data[["PC1"]], .data[["feature"]])) +
    geom_bar(stat="identity")

eig <- attr(reducedDim(tse, "PCA"), "varExplained")
eig <- eig / sum(eig)
df <- data.frame(eig_vals = eig)
# Get percentage of explained
df[["PC"]] <- paste0("PC", seq_len(nrow(df)))
df[["PC"]] <- factor(df[["PC"]], levels = df[["PC"]])
# Scree plot
p2 <- ggplot(df, aes(x = .data[["PC"]], .data[["eig_vals"]])) +
    geom_bar(stat="identity")

p1 + p2

image

@antagomir
Copy link
Member

antagomir commented Jul 4, 2024

Only some ordination methods provide eigenvalues (e.g. PCA, MDS) and I think only in PCA (out of the commonly used ordinations) the eigenvalues have an actual clear + commonly used interpretation. Thus a general function for this might not be feasible.

It would be interesting to experiment with different ways to visualize loading matrices. The barplot/screeplot is indeed very common. If we have generic function to visualize loadings, then this could give different options, like bar/screeplot, heatmap, etc.

However, heatmap (options 2) seems to me like a good alternative. In there, one should show also the sign of the loading (e.g. with different colors for negative and positive values); also do not use two different indicators (like size and color) to denote the same thing. If you could use native ggplot2 instead of ggcorrplot that could be useful (reducing package dependencies is in general good). I would swap the figure for better readability of species names.

The tree + heatmap is very nice since usually the loadings do not show the tree but this can be useful. The idea has been borrowed from Susan Holmes' phyloseq work.

@ElySeraidarian
Copy link
Contributor

Thanks for your returns, indeed it will be possible to create a call for a one line function that will do the plotting.
We can add different options to visualize loadings so that the user will find the best for himself.

I have not been able to plot the exact same plot using ggplot2 but here is an example I have build based on your returns

library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
df[["Feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
colours = c("red","white", "blue")


dfgroup <- data.frame(PC = c(df$PC1,df$PC2), Feature = c(df$Feature, df$Feature), group = c(rep("PC1",length(df$PC1)), rep("PC2", length(df$PC2))))

# Plot feature loadings
ggplot(dfgroup, aes(x = .data[["PC"]], y = .data[["Feature"]], group = as.factor(group), col = as.factor(group)))  +
  geom_point(size=6) +
  labs(title="Feature Loadings Plot") +
  scale_colour_manual(name = "Principal Components",
   values = c("red","blue")) +
  xlim(-1, 1)

image

It is a bit different as the loadings value differs from y-axis and there is no colour change for positive/negative values.

So here is also the improved version using ggcorrplot in case we want to use it

library(ggcorrplot)
library(mia)
library(ggtree)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
# Plot feature loadings
ggcorrplot(loadings_matrix,
           type = "lower",
           method="circle",
           title="Feature loadings",
           colors = colours,
           ggtheme=theme_bw) + coord_flip()

image

@antagomir
Copy link
Member

I would not combine the two axes like in the first plot. PC1 and PC2 we wouldn't compare directly in most cases. The original ggcorr plot looked better in fact.

A related idea, how about plotting just one axis at a time (as heatmap), sorting the values in increasing order, and then using colors to illustrate the increasing values (instead of bars)?

Probably the standard barplot / screeplot would be useful to include as option as well.

Note: often the user is interested on just a few top features (e.g. 5-10). This might help to find additional complementary options, too.

To get fwd, can we create one function that has the more standard barplot, and one heatmap version as options?

@ElySeraidarian
Copy link
Contributor

I will open a PR tomorrow with the function that would include every plotting method we talked about and should be operational at least for PCA (waiting for other methods to have same way to retrieve loadings).

Here is the example I made based on your returns.

library(mia)
library(ggtree)
library(scater)
library(patchwork)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
df[["Feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
df <- df[ order(abs(df[["PC1"]]), decreasing = TRUE)[1:10], ]
df <- df[ order(df[["PC1"]]), ]
df[["Feature"]] <- factor(df[["Feature"]], levels = df[["Feature"]])

df1 <- data.frame(PC1 = round(df$PC1,2), Feature = df$Feature)
df2 <- data.frame(PC2 = round(df$PC2,2), Feature = df$Feature)


# Plot feature loadings
p1 <- ggplot(df1, aes(x = "PC1", y = Feature, label = PC1))  +
  geom_point(aes(col = PC1), size=12) +
  scale_color_continuous(type = "gradient", limits = c(-1,1)) +
  geom_text(color="white", size=4) + 
  labs(title="Feature Loadings Plot") 
  
p2 <- ggplot(df2, aes(x = "PC2", y = Feature, label = PC2))  +
    geom_point(aes(col = PC2), size=12) +
    scale_color_continuous(type = "gradient", limits=c(-1,1)) +
    geom_text(color="white", size=4) +
  labs(title="Feature Loadings Plot")

p1 + p2

image

@antagomir
Copy link
Member

This could

  1. support loading matrices directly (numeric matrix of features x components); if it also supports TreeSE that is a bonus

  2. color scale should be two-color scale (negative values blue, positive red); to me it could look better if the colors were shown by squares rather than circles but this is a minor detail and if the plotting function works well otherwise, it should be relatively straightfwd to test alternative visualizations.

@ElySeraidarian
Copy link
Contributor

This one should be finally good

library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
df[["Feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
df <- df[ order(abs(df[["PC1"]]), decreasing = TRUE)[1:10], ]
df <- df[ order(df[["PC1"]]), ]
df[["Feature"]] <- factor(df[["Feature"]], levels = df[["Feature"]])

df1 <- data.frame(PC1 = round(df$PC1,2), Feature = df$Feature)
df2 <- data.frame(PC2 = round(df$PC2,2), Feature = df$Feature)


# Plot feature loadings
p1 <- ggplot(df1, aes(x = "PC1", y = Feature, label = PC1))  +
  geom_point(aes(fill = PC1), size=15, shape = 22) +
  scale_fill_gradient2(limits = c(-1,1)) +
  geom_text(color="black", size=4) + 
  labs(title="Feature Loadings Plot") 
  
p2 <- ggplot(df2, aes(x = "PC2", y = Feature, label = PC2))  +
    geom_point(aes(fill = PC2), size=15, shape = 22) +
    scale_fill_gradient2(limits=c(-1,1)) +
    geom_text(color="black", size=4) +
  labs(title="Feature Loadings Plot")

p1 + p2

image

@antagomir
Copy link
Member

Great!

Can you prepare an actual miaViz function that takes care of this?

@TuomasBorman
Copy link
Contributor Author

microbiome/mia#604

@antagomir
Copy link
Member

@ElySeraidarian can you close this issue as it becomes ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants