E14_kidney_analysis.Rmd

---
title: "E14_mouseKidney"
output:
  html_document:
    css: ~/Documents/haiyin.css
    toc: yes
    toc_depth: 4
  output: null
  pdf_document:
    toc: yes
    toc_depth: '4'
  word_document: default
---

#Single cell RNAseq analysis on E14.5 mouse kidney
Dr. Potter's group published an article to comapre the scRNA-seq results on E14.5 mouse kidney. They compared Dro-se, 10x and Fluidigm. The indentified and confiremd 16 distince cell populations during stage. I'd like to use this information to generate signature gene profile to estimate the cell subtype proportions

Cross-platform single cell analysis of kidney development shows stromal cells express Gdnf Bliss Magella; Mike Adam; Andrew S Potter; Meenakshi Venkatasubramanian; Kashish Chetal; Stuart B Hay; Nathan Salomonis; Steven S Potter Developmental Biology. 2017, DOI: 10.1016/j.ydbio.2017.11.006, PMID: 29183737

These data are deposited in the Gene Expression Omnibus (Accession GSE104396).
All processed datasets, ICGS results and classification outputs have been further deposited in the onlineSynapse data portal (https://www.synapse.org/#!Synapse:syn11001759) or through an interactive web browser (http://altanalyze.org/ICGS/Kidney.php).
I downloaded the data from GEO website.

## Data description
The description of analysis by the authors.

### Drop-seq data
Read2 was aligned with bowtie2-2.2.7 using the -k 1 option The aligned reads were tagged with their corresponding UMI and barcode from read1. Each aligned read was tagged with its gene name. An expression matrix was generated by counting the number of unique UMIs per gene per cell. Reads were normalized by dividing the read count of each gene by the total number of reads per cell and multiplying by 10000

### 10x data
Raw sequencing data was processed through 10X Genomics CellRanger v1.3.0 using default parameters to obtain the bam files. Cellranger produced two expression matrices, raw reads and counts-per-million normalized


```{r setup, echo=FALSE,message=FALSE,warning=FALSE}
Sys.setenv(TZ = "America/New_York")
library(Seurat)
library(cowplot)
```

## Loading single-cell RNA-seq data

#### Drop-seq data

```{r loading data drop-seq, cache=TRUE,echo=TRUE,eval=FALSE}
droseq.data <- read.table(file = "~/Documents//E14_MouseKidney/GSM2796988_run1605_2000_normalized.txt",header=TRUE)
data <-droseq.data[,2:ncol(droseq.data)]
SYMBOL<-droseq.data[,1]
aggdata_genename <-aggregate(data, by=list(SYMBOL),FUN=sum, na.rm=TRUE)  
row.names(aggdata_genename) <-aggdata_genename[,1]
aggdata_genename<-aggdata_genename[,-1]

cat("Small proportion of cells have less than 1000 genes, therefore I removed those from the analysis")
## The mitochondrial genes have been eliminated from the data.

droseq <- CreateSeuratObject(raw.data = aggdata_genename, min.cells = 3, min.genes = 1000)
VlnPlot(object = droseq, features.plot = c("nGene", "nUMI"), nCol = 3)
droseq <- NormalizeData(object = droseq) 
droseq <- ScaleData(object = droseq,vars.to.regress = c("nUMI"))
droseq <- FindVariableGenes(object = droseq, do.plot = FALSE)
```

#### 10x data

```{r loading data 10x, cache=TRUE,echo=TRUE,eval=FALSE}
tenx.data <- read.table(file = "~/Documents//E14_MouseKidney/GSM2796989_E14_5_WT_10X_matrix_CPM.txt",header=TRUE)
data <-tenx.data[,2:ncol(tenx.data)]
SYMBOL<-tenx.data[,1]
aggdata_genename <-aggregate(data, by=list(SYMBOL),FUN=sum, na.rm=TRUE)  
## sum the expression values of gene (by gene name)
row.names(aggdata_genename) <-aggdata_genename[,1]
aggdata_genename<-aggdata_genename[,-1]
cat("Same as 10xdata small proportion of cells have less than 1000 genes.")
## The mitochondrial genes have been eliminated from the data. Therefore I only regressed for nUMI.
tenx <- CreateSeuratObject(raw.data = aggdata_genename,min.cells = 3, min.genes = 1000)
VlnPlot(object = tenx, features.plot = c("nGene", "nUMI"), nCol = 3)

## The mitochondrial genes have been eliminated from the data. Therefore I only regressed for nUMI.
tenx <- NormalizeData(object = tenx)
tenx <- ScaleData(object = tenx,vars.to.regress = c("nUMI"))
tenx <- FindVariableGenes(object = tenx, do.plot = FALSE)
```

### Finding common genes between the two dataset. 

I selected top 2000 genes from both data

```{r finding overlapped genes in both dataset, echo=TRUE, cache=TRUE,eval=FALSE}
hvg.droseq <- rownames(x = head(x = droseq@hvg.info, n = 2000))
hvg.tenx <- rownames(x = head(x = tenx@hvg.info, n = 2000))
hvg.union <- union(x = hvg.droseq, y = hvg.tenx)

## also added up protocol information
tenx@meta.data[, "protocol"] <- "10X"
droseq@meta.data[, "protocol"] <- "droseq"
```

### Running CCA to find the dimensions for adjustment

```{r Running cca, echo=TRUE,cache=TRUE,eval=FALSE}
kidney <- RunCCA(object = tenx, object2 = droseq, genes.use = hvg.union)
save(kidney, file="~/Documents/E14_MouseKidney/kidney.Robj") ## I stored intact as robject
```

After the cca, I stored data as "intact" data.

### Tesitng differences between datasets

```{r plot cca, echo=TRUE,fig.align="center",fig.height=6,fig.height=6}
load("~/Documents/E14_MouseKidney/kidney.Robj")
p1 <- DimPlot(object = kidney, reduction.use = "cca", group.by = "protocol", pt.size = 0.2, 
    do.return = TRUE)
p2 <- VlnPlot(object = kidney, features.plot = "CC1", group.by = "protocol", do.return = TRUE, point.size = 0.1)
plot_grid(p1, p2)
```

### Calculating variation between the dataset

```{r plot dim_heatmap, echo=TRUE,cache=TRUE}
kidney <- CalcVarExpRatio(object = kidney, reduction.type = "pca", grouping.var = "protocol", 
    dims.use = 1:15)
```

### Eliminating cells with high variance

I discard cells where the variance explained by CCA is <2-fold (ratio < 0.5) compared to PCA.
This is a defalut setting used in the Seurat CCA guidline.

```{r eliminating high variance cells,cache=TRUE, echo=TRUE}
kidney.all.save <- kidney
kidney <- SubsetData(object = kidney, subset.name = "var.ratio.pca", accept.low = 0.5)
kidney.discard <- SubsetData(object = kidney.all.save, subset.name = "var.ratio.pca", 
    accept.high = 0.5)
median(x = kidney@meta.data[, "nGene"])
median(x = kidney.discard@meta.data[, "nGene"])
```

I plotted one of representative genes which expression status were varied between the datasets.

```{r discarded data view,echo=TRUE,fig.align="center",fig.height=6,fig.height=6}
VlnPlot(object = kidney.discard, features.plot = "Icam2", group.by = "protocol")
```

### Heatmaps to decide the dimention of CC for the adjustment
I plotted heatmaps to show gene which contributed to varations of CCs.

```{r Heatmap dimention,echo=TRUE,fig.align="center",fig.height=6,fig.height=6}
DimHeatmap(object = kidney, reduction.type = "cca", cells.use = 500, dim.use = 1:9, 
    do.balanced = TRUE)
DimHeatmap(object = kidney, reduction.type = "cca", cells.use = 500, dim.use = 10:18, 
    do.balanced = TRUE)
```

I selected CC1 to CC15 for the adjustment.

### Adjusting for CCs

The datasets were adjusted for CCs (CC1 to CC15) to eliminate the effects of methods.

```{r Adjusting data using cca, echo=TRUE,cache=TRUE}
kidney <- AlignSubspace(object = kidney, reduction.type = "cca", grouping.var = "protocol", 
    dims.align = 1:15)
save(kidney, file="~/Documents/E14_MouseKidney/kidney2.Robj")
```

I saved the object as "~/Documents/E14_MouseKidney/kidney2.Robj".

I plotted the ACC distributions between Drop-seq and 10x.

```{r Plotting adjusted data,echo=TRUE,fig.align="center",fig.height=6,fig.height=6}
p1 <- VlnPlot(object = kidney, features.plot = "ACC1", group.by = "protocol", 
    do.return = TRUE, point.size = 0.1)
p2 <- VlnPlot(object = kidney, features.plot = "ACC2", group.by = "protocol", 
    do.return = TRUE, point.size = 0.1)
plot_grid(p1, p2)
```

Now, both datasets showed similar distributions.

### Obtaning cell clusters

I used CC1 to CC15 and resolution 1.2 to identify the cell clusters.

```{r Clustering cells, cache=TRUE,echo=TRUE,fig.align="center",fig.height=6,fig.height=6}
kidney <- RunTSNE(object = kidney, reduction.use = "cca.aligned", dims.use = 1:15, 
    do.fast = TRUE)
kidney <- FindClusters(object = kidney, reduction.type = "cca.aligned", dims.use = 1:15, resolution = 1.2,
    save.SNN = TRUE)

p1 <- TSNEPlot(object = kidney, group.by = "protocol", do.return = TRUE, pt.size = 0.2)
p2 <- TSNEPlot(object = kidney, do.return = TRUE, pt.size = 0.2,do.label = T)
plot_grid(p1, p2)

save(kidney,file="~/Documents/E14_MouseKidney/kidney_mimic.Robj")
```

The object was save as "~/Documents/E14_MouseKidney/kidney_mimic.Robj

This is the final object.

I also tesed the clustering status between each cell clusters.

```{r Plotting cluter tree,cache=TRUE,echo=TRUE,fig.align="center",fig.height=10,fig.height=4}
kidney_2 <- BuildClusterTree(kidney, do.plot = TRUE)
```

### Identifying clesters
I plotted several known genes to identify the each cluters.

```{r plotting known gene,echo=FALSE,warning=FALSE,message=FALSE}
vp1<-VlnPlot(object = kidney, features.plot = c("Krt8"),point.size.use = 0)
vp2<-VlnPlot(object = kidney, features.plot = c("Ret"),point.size.use = 0)
vp3<-VlnPlot(object = kidney, features.plot = c("Calb1"),point.size.use = 0)
vp4<-VlnPlot(object = kidney, features.plot = c("Slc12a1"),point.size.use = 0)
vp5<-VlnPlot(object = kidney, features.plot = c("Lhx1"),point.size.use = 0)
vp6<-VlnPlot(object = kidney, features.plot = c("Hes5"),point.size.use = 0)
vp7<-VlnPlot(object = kidney, features.plot = c("Mafb"),point.size.use = 0)
vp8<-VlnPlot(object = kidney, features.plot = c("Hnf4a"),point.size.use = 0)
vp9<-VlnPlot(object = kidney, features.plot = c("Wnt4"),point.size.use = 0)
vp10<-VlnPlot(object = kidney, features.plot = c("Crym"),point.size.use = 0)
vp11<-VlnPlot(object = kidney, features.plot = c("Icam2"),point.size.use = 0)
vp12<-VlnPlot(object = kidney, features.plot = c("Crym"),point.size.use = 0)
vp13<-VlnPlot(object = kidney, features.plot = c("Dlk1"),point.size.use = 0)
vp14<-VlnPlot(object = kidney, features.plot = c("Foxd1"),point.size.use = 0)
vp15<-VlnPlot(object = kidney, features.plot = c("Snai2"),point.size.use = 0)
vp16<-VlnPlot(object = kidney, features.plot = c("Rprm"),point.size.use = 0)
vp17<-VlnPlot(object = kidney, features.plot = c("Pcp4"),point.size.use = 0)
vp18<-VlnPlot(object = kidney, features.plot = c("Snai2"),point.size.use = 0)
vp19<-VlnPlot(object = kidney, features.plot = c("Penk"),point.size.use = 0)
vp20<-VlnPlot(object = kidney, features.plot = c("Alx1"),point.size.use = 0)
vp21<-VlnPlot(object = kidney, features.plot = c("Wnt7b"),point.size.use = 0)
vp22<-VlnPlot(object = kidney, features.plot = c("Wnt11"),point.size.use = 0)
vp23<-VlnPlot(object = kidney, features.plot = c("Dkk2"),point.size.use = 0)
vp24<-VlnPlot(object = kidney, features.plot = c("Wisp1"),point.size.use = 0)
```


```{r plotting violin plots of known gene expression,fig.align="center", fig.height=12, fig.height=12}
plot_grid(vp1, vp2,vp3,vp4,vp5,vp6, ncol = 3, align = 'v')
plot_grid(vp7, vp8,vp9,vp10,vp11,vp12, ncol = 3, align = 'v')
plot_grid(vp13, vp14,vp15,vp16,vp17,vp18, ncol = 3, align = 'v')
plot_grid(vp19, vp20,vp21,vp22,vp23,vp24, ncol = 3, align = 'v')
```

### Finding marker genes to indentify each cell clusters

I selected genes which is at least 30% of cells in the cluster were expressed and the log2 folod difference between the cluster and other clusters was greater than 1.5.

```{r Finding marker genes, cache=TRUE,echo=TRUE, fig.align="center", fig.height=6, fig.height=6}
kidney.markers <- FindAllMarkers(kidney, only.pos = TRUE, min.pct = 0.3,test.use="bimod",logfc.threshold =0.59)
save(kidney.markers,file = "~/Documents/E14_MouseKidney//kidney_markers.Robj")
write.table(kidney.markers,file="~/Documents/E14_MouseKidney/kidney.markers.csv",sep=",")
head(kidney.markers)
cat("Number of the marker gene = ",nrow(kidney.markers))
```

I saved result as "~/Documents/E14_MouseKidney//kidney_markers.Robj" and "~/Documents/E14_MouseKidney/kidney.markers.csv".

### Generating signature gene profile

I calculated the median of signature genes in each cell cluster.

```{r Generating signature gene list,, cache=TRUE,echo=TRUE, fig.align="center", fig.height=6, fig.height=6}
normalized<- data.frame(kidney@scale.data)
celltypes <-data.frame(kidney@ident)
geneLIST<-c(kidney.markers$gene)

sig_gene<-normalized[which(row.names(normalized)%in%geneLIST),]
Cluster_0<-sig_gene[,which(celltypes==0)]
Cluster_1<-sig_gene[,which(celltypes==1)]
Cluster_2<-sig_gene[,which(celltypes==2)]
Cluster_3<-sig_gene[,which(celltypes==3)]
Cluster_4<-sig_gene[,which(celltypes==4)]
Cluster_5<-sig_gene[,which(celltypes==5)]
Cluster_6<-sig_gene[,which(celltypes==6)]
Cluster_7<-sig_gene[,which(celltypes==7)]
Cluster_8<-sig_gene[,which(celltypes==8)]
Cluster_9<-sig_gene[,which(celltypes==9)]
Cluster_10<-sig_gene[,which(celltypes==10)]
Cluster_11<-sig_gene[,which(celltypes==11)]
Cluster_12<-sig_gene[,which(celltypes==12)]
Cluster_13<-sig_gene[,which(celltypes==13)]
Cluster_14<-sig_gene[,which(celltypes==14)]
Cluster_15<-sig_gene[,which(celltypes==15)]
Cluster_16<-sig_gene[,which(celltypes==16)]

j=nrow(sig_gene)
signature_data <-matrix(nrow=j,ncol=17)
for(i in 1:j)
  signature_data[i,1]<-10^median(as.double(Cluster_0[i,]))
for(i in 1:j)
  signature_data[i,2]<-10^median(as.double(Cluster_1[i,]))
for(i in 1:j)
  signature_data[i,3]<-10^median(as.double(Cluster_2[i,]))
for(i in 1:j)
  signature_data[i,4]<-10^median(as.double(Cluster_3[i,]))
for(i in 1:j)
  signature_data[i,5]<-10^median(as.double(Cluster_4[i,]))
for(i in 1:j)
  signature_data[i,6]<-10^median(as.double(Cluster_5[i,]))
for(i in 1:j)
  signature_data[i,7]<-10^median(as.double(Cluster_6[i,]))
for(i in 1:j)
  signature_data[i,8]<-10^median(as.double(Cluster_7[i,]))
for(i in 1:j)
  signature_data[i,9]<-10^median(as.double(Cluster_8[i,]))
for(i in 1:j)
  signature_data[i,10]<-10^median(as.double(Cluster_9[i,]))

for(i in 1:j)
  signature_data[i,11]<-10^median(as.double(Cluster_10[i,]))

for(i in 1:j)
  signature_data[i,12]<-10^median(as.double(Cluster_11[i,]))

for(i in 1:j)
  signature_data[i,13]<-10^median(as.double(Cluster_12[i,]))
for(i in 1:j)
  signature_data[i,14]<-10^median(as.double(Cluster_13[i,]))

for(i in 1:j)
  signature_data[i,15]<-10^median(as.double(Cluster_14[i,]))

for(i in 1:j)
  signature_data[i,16]<-10^median(as.double(Cluster_15[i,]))
for(i in 1:j)
  signature_data[i,17]<-10^median(as.double(Cluster_16[i,]))


colnames(signature_data)<-c("cluster_0","cluster_1","cluster_2","cluster_3","cluster_4","cluster_5","cluster_6","cluster_7","cluster_8"
                           ,"cluster_9","cluster_10","cluster_11","cluster_12","cluster_13","cluster_14","cluster_15","cluster_16")
row.names(signature_data)<-row.names(sig_gene)

write.table(signature_data,"~/Documents/E14_MouseKidney//singarture_scRNA_kidney_c16.txt",sep="\t",row.names = T,col.names = T,quote=F)
```

The obtained signature gene profile was stored as "~/Documents/E14_MouseKidney//singarture_scRNA_kidney_c16.txt"

# Testing known datasets

Then, I tested if the signature genes were acculately estimating cell subtype proportion or not. I used two GEO datasets to test.

Lim1 deletion showed nephron depleted kidney and Sall1 deletion showed altered differentiations of mesenchymes.

```{r setting 2,echo=FALSE,message=FALSE, warning=FALSE}
library(lrcde)
library(GEOquery)
library(sva)
library(openxlsx)
library(Biobase )
library(dplyr)
library(annotate)
library(org.Mm.eg.db)
library(mouse4302.db)
library(affy)
library(reshape)
library(dendextend)
library(limma)
library(RColorBrewer)
library(gplots)
```


### GSE45845
Kidneys at E14.5 were obtained from nephron progenitor-specific Sall1 deletion ( 2 sets) and inducible Sall1 deletion 48 hrs after tamoxifen treatment (1 set). Sall1 maintains nephron progenitors and nascent nephrons by acting as both an activator and a repressor. J Am Soc Nephrol 2014 Nov;25(11):2584-95. PMID: 24744442
Kanda S, Tanigawa S, Ohmori T, Taguchi A et al.

They have three WT and three mjutant (conditional KO).
strain: hybrid of 129 and C57BL/6

Loading GEO data, the gene name information was obtained from GPL7202-9760 and the expression status was aggregated by gene name (mean of same gene name entries).

```{r GSE45845 loading data, cache=TRUE,fig.align="center", fig.width=6,fig.height=6}
gse.es <- getGEO(filename="~/Documents/E14_MouseKidney/GSE45845/GSE45845-GPL7202_series_matrix.txt.gz", GSEMatrix = TRUE, getGPL = FALSE)
e <- exprs(gse.es)
show(pData(phenoData(gse.es))[,1:2])
pData(phenoData(gse.es))$group <-c("WT","MUT","WT","MUT","WT","MUT")
data <- data.frame(exprs(gse.es))

Annot <- read.table(file="~/Documents/E14_MouseKidney/GSE45845/GPL7202-9760.txt",sep="\t",header=T,fill = TRUE)
Annot2 <-Annot[which(Annot$GENE_SYMBOL!=""),]
row.names(Annot2) <-Annot2[,1]
all <- merge(Annot2, data, by="row.names")

head(all)
SYMBOL<-all$GENE_SYMBOL
aggdata_all <-aggregate(all[,c(5:10)], by=list(SYMBOL),FUN=mean, na.rm=TRUE)  

write.table(aggdata_all,file="~/Documents//E14_MouseKidney/GSE45845/aggdata_expression.txt",sep="\t",row.names=FALSE,col.names=TRUE,quote=FALSE)

j <-hclust(dist(t(aggdata_all[,2:7])), "ward.D2")
plot(j)
```

Estimated cell subtype proportion using CIBERSORT, I used 1000 permutation for testing significance.

```{r GSE45845 loading CIBERSORT result,cache=TRUE,fig.align="center", fig.width=6,fig.height=6}

Est_CellType <-read.csv(file="~/Documents/E14_MouseKidney/GSE45845/CIBERSORT.Output_Job1.csv",header = T)
row.names(Est_CellType) <-Est_CellType[,1]
Est_CellType<-Est_CellType[,-1]
group <-c("WT","MUT","WT","MUT","WT","MUT")
pdata <-cbind(Est_CellType[,1:17],group)
pdata$sample <- row.names(pdata)
dat.m<-melt(pdata[,c(1:17,19)],id.vars = 'sample')

a <-c("pink","lightblue","pink","lightblue","pink","lightblue")
p6 <- ggplot(dat.m, aes(sample, value, fill = variable)) +
  geom_bar(position = "fill", stat = "identity") +
labs(x="", y="Percentage")+
  theme(axis.text=element_text(angle = 45, size=6))+
  theme(axis.text.x=element_text(colour = a))+
ggtitle("GSE45845: WT-pink,MUT-lightblue, Sall1CreERfSall1 E14.5 WT vs KO ") +
  theme(plot.title = element_text(hjust = 0, size=6))
p6
```

Cell subtype proportion differences between WT and mutant. I tested the differences of each cell subtype proportion with Student's t-test.

```{r GSE45845 cell subtype proportion stats,cache=TRUE,fig.align="center", fig.width=6,fig.height=6}

classes1=as.factor(pdata$group)
test<-pdata[,1:17]
stats_val <-matrix(ncol=4,nrow=17)
for(i in 1:17){
  stats_val[i,1] <-mean((as.double(test[which(classes1=="WT"),i])))
  stats_val[i,2] <-mean((as.double(test[which(classes1=="MUT"),i])))
  stats_val[i,3] <-mean((as.double(test[which(classes1=="WT"),i])))-mean((as.double(test[which(classes1=="MUT"),i])))
  stats_val[i,4] <-t.test(as.double(test[which(classes1=="WT"),i]),as.double(test[which(classes1=="MUT"),i]))$p.value
}
row.names(stats_val)<-colnames(pdata)[1:17]
colnames(stats_val)<-c("Median_WT","Median_MUT","Diff(median, WT-MUT)","pvalue(wilcox.test)")
stats_val
```


## Finding differentially expressed genes

```{r DE GSE45845,echo=FALSE,message=TRUE}
# design matrix
library(limma)
Est_CellType <-read.csv(file="~/Documents/E14_MouseKidney/GSE45845/CIBERSORT.Output_Job1.csv",header = T)
row.names(Est_CellType) <-Est_CellType[,1]
Est_CellType<-Est_CellType[,-1]
group <-c("WT","MUT","WT","MUT","WT","MUT")
pdata <-cbind(Est_CellType[,1:17],group)

design_NA = model.matrix(~0 + group,data=pdata)
colnames(design_NA)[1:2] = c("MUT","WT")

exp= read.table("~/Documents//E14_MouseKidney/GSE45845/aggdata_expression.txt",sep="\t",header = T)
row.names(exp)<-exp[,1]
exp <-exp[,-1]
# fit the linear model
all.equal(colnames(exp),rownames(pdata))
fit_NA = lmFit(exp,design_NA)

# create contrast matrix
contMatrix_NA = makeContrasts(WT-MUT,
                               levels=design_NA)
fit2_NA = contrasts.fit(fit_NA,contMatrix_NA)
fit2_NA = eBayes(fit2_NA)
# toptable
DE_NA = topTable(fit2_NA, coef=1, num=Inf)

# get the significant DEs
## p<=0.00001 & logFC>log2(1.5)
sig_NA = DE_NA[DE_NA$P.Value<0.001&abs(DE_NA$logFC)>log2(2),]
##

```

Testing PCs for adjustment

```{r Celltype vs. PCs of Exp,echo=FALSE,fig.align="center",fig.width=6,fig.height=6}
library(RColorBrewer)
exp_pca = prcomp(exp)
exp_pca_res = exp_pca$rotation

Cell_prop = pdata[,c(1:17)]
Cell_pca = prcomp(t(Cell_prop),scale=T)
Cell_pca_res = Cell_pca$rotation[,1:6]
all.equal(rownames(exp_pca_res),rownames(Cell_pca_res))

all.equal(rownames(pdata),rownames(Cell_prop))
Cell_exp_pca_res = data.frame(exp_pca_res,pdata)

lm22_pca_p = NULL
lm22_pca_r_sq = NULL
for (i in 1:6){
    sub_pca_p = NULL
    sub_pca_r_sq = NULL
    sub_LM22_pc = Cell_pca_res[,i]
    for (pc in 1:6){
        sub_exp_pc = exp_pca_res[,pc]
        
        sub_lm = lm(sub_exp_pc~sub_LM22_pc)
        
        sub_r_sq = summary(sub_lm)$adj.r.squared
        sub_p = summary(sub_lm)$coefficients[2,4]
        
        sub_pca_p = c(sub_pca_p,sub_p)
        sub_pca_r_sq = c(sub_pca_r_sq,sub_r_sq)
    }
    lm22_pca_p = cbind(lm22_pca_p,sub_pca_p)
    lm22_pca_r_sq = cbind(lm22_pca_r_sq,sub_pca_r_sq)
}
colnames(lm22_pca_p) = paste0("CellTypePro_pc",seq(1,6))
rownames(lm22_pca_p) = paste0("Exp_pc",seq(1,6))
colnames(lm22_pca_r_sq) = paste0("CellTypePro_pc",seq(1,6))
rownames(lm22_pca_r_sq) = paste0("Exp_pc",seq(1,6))

lm22_pca_log_p = -log10(t(lm22_pca_p))

breaks = unique(c(seq(0,3,0.025)))
hmcol2<-colorRampPalette(brewer.pal(9,"BuGn"))(80)
heatmap.2(lm22_pca_log_p, Rowv=FALSE, Colv=FALSE, dendrogram='none', trace='none', margins=c(10,10), col=hmcol2[1:which(breaks==round(max(lm22_pca_log_p)))-1],
          colsep=c(1:10), rowsep=c(1:20), sepwidth=c(0.05, 0.025),sub="Associations between PCs of Expression and PCs of cell subtype porportions",
          symm=F,symkey=F,symbreaks=F,breaks=breaks[1:which(breaks==round(max(lm22_pca_log_p)))],cexRow = 0.8,cexCol = 0.8)
```


```{r differentially expressed genes, echo=TRUE, }

LM22_pc_asso_max = apply(lm22_pca_log_p,1,max)

LM22_pc_asso_max
#which(LM22_pc_asso_max>2&as.matrix(summary(Cell_pca)$importance)[2,1:6]>0.01)
# no PC was selected with this criteria. Therefore, I used all PCs for the adjustment.

design_LM22_pc = model.matrix(~0 + group  
                               + PC1 + PC2
                               #+ PC6 
                              #+ PC7 + PC8
                               ,data=Cell_exp_pca_res)
colnames(design_LM22_pc)[1:2] =c("MUT","WT")

# fit the linear model
all.equal(colnames(exp),rownames(design_LM22_pc))
fit_LM22_pc = lmFit(exp,design_LM22_pc) ## Coefficients not estimable: PC6 

## PC1 and PC2 can explain almost all (99.8% variance of data)
# create contrast matrix
contMatrix_LM22_pc = makeContrasts(WT-MUT,
                                levels=design_LM22_pc)
fit2_LM22_pc = contrasts.fit(fit_LM22_pc,contMatrix_LM22_pc)
fit2_LM22_pc = eBayes(fit2_LM22_pc)

DE_LM22_pc = topTable(fit2_LM22_pc, coef=1, num=Inf)
# get the significant DEs
sig_LM22_pc = DE_LM22_pc[DE_LM22_pc$P.Value<0.001&abs(DE_LM22_pc$logFC)>log2(1.5),]

intersect(row.names(sig_LM22_pc),row.names(sig_NA))
head(sig_LM22_pc)
head(sig_NA)

```


### GSE4230 Lim1 KO
Chen et al. reported that Homeobox Lim1 mutant showed nephron-deficient kidney phenotype on 2006.
Chen YT, Kobayashi A, Kwan KM, Johnson RL et al. Gene expression profiles in developing nephrons using Lim1 metanephric mesenchyme-specific conditional mutant mice. BMC Nephrol 2006 Feb 7;7:1. PMID: 16464245
Summary (by the authors) Analysis of embryonic day 14.5 and 18.5 kidneys from Lim1 conditional mutants. Lim1 is a homoebox gene that is essential for nephrogenesis. The mutant is ablated for Lim1 in the metanephric mesenchyme, resulting in nephron deficient kidneys. Results identify genes expressed in developing nephrons.
They have two mutant and two wildtype kidney gene expression profiles.
GEO accession number: GSE4230
Strain: 129 / C57BL/6 / SJL/J mixed

Loading GEO data, the gene name information was obtained from GPL7202-9760 and the expression status was aggregated by gene name (mean of same gene name entries).


```{r GSE4230 loading data, cache=TRUE,fig.align="center", fig.width=6,fig.height=6}
gse.es <- getGEO(filename="~/Documents/E14_MouseKidney/GSE4230/GSE4230_series_matrix.txt.gz", GSEMatrix = TRUE, getGPL = FALSE)
e <- exprs(gse.es)
show(pData(phenoData(gse.es))[,1:2])
## they do not have the phenodata

pData(phenoData(gse.es))$age <-c(rep("E14.5",4),rep("E18.5",4))
pData(phenoData(gse.es))$group <-c(rep("WT",2),rep("MUT",2),rep("WT",2),rep("MUT",2))

data <- data.frame(exprs(gse.es))
Annot <- data.frame(ACCNUM=sapply(contents(mouse4302ACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(mouse4302SYMBOL), paste, collapse=", "), DESC=sapply(contents(mouse4302GENENAME), paste, collapse=", "))
all <- merge(Annot, data, by.x=0, by.y=0, all=T)
head(all)
SYMBOL<-all$SYMBOL
aggdata_all <-aggregate(all[,c(5:8)], by=list(SYMBOL),FUN=mean, na.rm=TRUE)  

length(intersect(all$SYMBOL,kidney.markers$gene))
```

Estimated cell subtype proportion using CIBERSORT, I used 1000 permutation for testing significance.

```{r GSE4230 loading CIBERSORT result,cache=TRUE,fig.align="center", fig.width=6,fig.height=6}
Est_CellType <-read.csv(file="~/Documents/E14_MouseKidney/GSE4230/CIBERSORT.Output_Job6.csv",header = T)
row.names(Est_CellType) <-Est_CellType[,1]
Est_CellType<-Est_CellType[,-1]
group <-c(rep("WT",2),rep("MUT",2))

pdata <-cbind(Est_CellType[,1:17],group)
pdata$sample <- row.names(pdata)
dat.m<-melt(pdata[,c(1:17,19)],id.vars = 'sample')

a <-c(rep("pink",2),rep("lightblue",2))
p5 <- ggplot(dat.m, aes(sample, value, fill = variable)) +
  geom_bar(position = "fill", stat = "identity") +
labs(x="", y="Percentage")+
  theme(axis.text=element_text(angle = 45, size=6))+
  theme(axis.text.x=element_text(colour = a))+
ggtitle("GSE4230: WT-pink,MUT-lightblue, E14.5 Lim1 conditional mutant kidneys") +
  theme(plot.title = element_text(hjust = 0, size=6))+ theme(legend.position="none")
p5
```

Cell subtype proportion differences between WT and mutant. I tested the differences of each cell subtype proportion with Student's t-test.

```{r GSE4230 cell subtype stats,cache=TRUE,fig.align="center", fig.width=6,fig.height=6}
classes1=as.factor(pdata$group)
test<-pdata[,1:10]
stats_val <-matrix(ncol=4,nrow=10)
for(i in 1:10){
  stats_val[i,1] <-mean((as.double(test[which(classes1=="WT"),i])))
  stats_val[i,2] <-mean((as.double(test[which(classes1=="MUT"),i])))
  stats_val[i,3] <-mean((as.double(test[which(classes1=="WT"),i])))-mean((as.double(test[which(classes1=="MUT"),i])))
  stats_val[i,4] <-t.test(as.double(test[which(classes1=="WT"),i]),as.double(test[which(classes1=="MUT"),i]))$p.value
}
row.names(stats_val)<-colnames(pdata)[1:10]
colnames(stats_val)<-c("Median_WT","Median_MUT","Diff(median, WT-MUT)","pvalue(wilcox.test)")
stats_val
```

### Comparing cell subtype proportions between different studies


```{r comparison between WTs, cache=TRUE,fig.align="center", fig.width=6,fig.height=6}
load("~/Documents/E14_MouseKidney//kidney_markers.Robj")
load("~/Documents/E14_MouseKidney/kidney_mimic.Robj")

GSE45845cells <-read.csv(file="~/Documents/E14_MouseKidney/GSE45845/CIBERSORT.Output_Job1.csv",header = T)
row.names(GSE45845cells) <-GSE45845cells[,1]
GSE45845cells<-GSE45845cells[,-1]
GSE45845cells$group <-c("WT","MUT","WT","MUT","WT","MUT")

GSE4230cells <-read.csv(file="~/Documents/E14_MouseKidney/GSE4230/CIBERSORT.Output_Job6.csv",header = T)
row.names(GSE4230cells) <-GSE4230cells[,1]
GSE4230cells<-GSE4230cells[,-1]
GSE4230cells$group <-c(rep("WT",2),rep("MUT",2))

merged_cell <-rbind(GSE45845cells,GSE4230cells)

j<-hclust(dist(merged_cell[,1:17]))
col<-as.factor(merged_cell$group)
#ColorDendrogram(j,y=col,main="Estimated cell subtype proportions",branchlength=3,labes=row.names(merged_cell)[1:17])

colorCodes <- c("WT"="blue","MUT"="red")
hc.cols <- hclust(dist(merged_cell[,1:17]), "ward.D2")
dend <- as.dendrogram(hc.cols)
labels_colors(dend) <- colorCodes[merged_cell$group][order.dendrogram(dend)]
plot(dend,main="Estimated cell subtype proportions")
```