Last updated: 2018-08-03
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180731)
The command set.seed(20180731)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 2689835
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Ignored: inst/.DS_Store
Ignored: inst/doc/.DS_Store
Ignored: vignettes/.DS_Store
Untracked files:
Untracked: Clustering_analyses_figure4_islets_sv1_3.pdf
Untracked: docs/figure/hidden_heterogeneity_glioblastoma.Rmd/
Untracked: docs/figure/tSNE_post_IA-SVA_3celltypes.Rmd/
Untracked: output/CC_genes.long.txt
Untracked: output/CC_genes.short.txt
Untracked: output/Clustering_analyses_figure3_sv1.pdf
Untracked: output/Lawlor_Islets_3Cells_CellView_Seurat_FigS.pdf
Untracked: output/Lawlor_Islets_3Cells_IASVA_SV1SV3_rsqcutoff0.3_pheatmap_iasvaV0.95_Figure4_C.pdf
Untracked: output/Lawlor_Islets_3Cells_IASVA_SV4_rsqcutoff0.3_pheatmap_iasvaV0.95.pdf
Untracked: output/Lawlor_Islets_3Cells_IASVA_pairs4SVs_iasvaV0.95_black_FigS6.pdf
Untracked: output/Lawlor_Islets_3Cells_IASVA_pairs4SVs_iasvaV0.95_color_FigS6.pdf
Untracked: output/Lawlor_Islets_3Cells_SV1_SV3_Cell_Type_Genes_rsqcutoff0.3.txt
Untracked: output/Lawlor_Islets_3Cells_SV4_Genes_rsqcutoff0.3.txt
Untracked: output/Lawlor_Islets_3Cells_tSNE_IA-SVA_Fig4AB.pdf
Untracked: output/Patel_Glioblastoma_MGH30_CellCycle_Figure3ABCD.pdf
Untracked: output/Patel_Glioblastoma_MGH30_Cellcycle_SV1_Genes_rsqcutoff0.3.txt
Untracked: output/Patel_Glioblastoma_MGH30_Cellcycle_SV1_Genes_rsqcutoff0.4.txt
Untracked: output/Patel_Glioblastoma_MGH30_iasva_SV1_genes_rsqcutoff0.3_pheatmap_iasvaV0.95_Figure3F.pdf
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Here, we used single cell RNA sequencing (scRNA-Seq) data with strong confounding variables, which is also obtained from human pancreatic islet samples (Xin et. al., 2016). This dataset is included in an R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the ‘iasvaExamples’ package, follow the instructions provided in the GitHub page.
rm(list=ls())
library(sva)
Loading required package: mgcv
Loading required package: nlme
This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
Loading required package: genefilter
Loading required package: BiocParallel
library(Seurat)
Loading required package: ggplot2
Loading required package: cowplot
Attaching package: 'cowplot'
The following object is masked from 'package:ggplot2':
ggsave
Loading required package: Matrix
library(iasva)
library(iasvaExamples)
library(dbscan)
library(irlba)
library(Rtsne)
library(pheatmap)
library(corrplot)
corrplot 0.84 loaded
library(DescTools) #pcc i.e., Pearson's contingency coefficient
Attaching package: 'DescTools'
The following object is masked from 'package:genefilter':
AUC
library(RColorBrewer)
library(cluster)
library(SummarizedExperiment)
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:Matrix':
colMeans, colSums, rowMeans, rowSums, which
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind,
colMeans, colnames, colSums, dirname, do.call, duplicated,
eval, evalq, Filter, Find, get, grep, grepl, intersect,
is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which, which.max,
which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:Matrix':
expand
The following object is masked from 'package:base':
expand.grid
Loading required package: IRanges
Attaching package: 'IRanges'
The following object is masked from 'package:nlme':
collapse
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: 'matrixStats'
The following objects are masked from 'package:Biobase':
anyMissing, rowMedians
The following objects are masked from 'package:genefilter':
rowSds, rowVars
Attaching package: 'DelayedArray'
The following objects are masked from 'package:matrixStats':
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following objects are masked from 'package:base':
aperm, apply
color.vec <- brewer.pal(8, "Set1")
tol21rainbow= c("#771155", "#AA4488", "#CC99BB", "#114477", "#4477AA",
"#77AADD", "#117777", "#44AAAA", "#77CCCC", "#117744",
"#44AA77", "#88CCAA", "#777711", "#AAAA44", "#DDDD77",
"#774411", "#AA7744", "#DDAA77", "#771122", "#AA4455",
"#DD7788")
# Normalization.
normalize <- function(counts)
{
normfactor <- colSums(counts)
return(t(t(counts)/normfactor)*median(normfactor))
}
data("Xin_Islet_scRNAseq_Read_Counts")
data("Xin_Islet_scRNAseq_Annotations")
ls()
[1] "color.vec" "normalize"
[3] "tol21rainbow" "Xin_Islet_scRNAseq_Annotations"
[5] "Xin_Islet_scRNAseq_Read_Counts"
counts <- Xin_Islet_scRNAseq_Read_Counts
anns <- Xin_Islet_scRNAseq_Annotations
dim(anns)
[1] 1600 9
dim(counts)
[1] 26542 1600
Lib_Size <- colSums(counts)
plot(sort(Lib_Size))
hist(Lib_Size)
summary(Lib_Size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
199678 773969 1002044 1104395 1288619 7369192
The annotations describing the islet samples and experimental settings are stored in “anns” and the read counts information is stored in “counts”.
dim(counts)
[1] 26542 1600
dim(anns)
[1] 1600 9
anns <- droplevels(anns)
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.7040829
filter = apply(counts, 1, function(x) length(x[x>5])>=3)
counts = counts[filter,]
dim(counts)
[1] 19169 1600
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.5913386
summary(anns)
X Age Cell_Type Condition
SRR3541303: 1 Min. :23.00 alpha:946 non-diabetic:651
SRR3541304: 1 1st Qu.:37.00 beta :503 T2D :949
SRR3541305: 1 Median :42.00 delta: 58
SRR3541306: 1 Mean :43.89 PP : 93
SRR3541307: 1 3rd Qu.:55.00
SRR3541308: 1 Max. :68.00
(Other) :1594
Donor_ID Ethnicity Gender Num_Expressed_Genes T2D 4 :306 AA:369 female:658 Min. :1745
T2D 6 :224 AI: 45 male :942 1st Qu.:4412
T2D 5 :176 C :593 Median :5158
Non T2D 12:144 H :593 Mean :5185
T2D 1 : 91 3rd Qu.:5902
T2D 2 : 83 Max. :9731
(Other) :576
Mitochondrial_Fraction Min. : 0.2812
1st Qu.: 4.1449
Median : 6.3422
Mean : 7.4164
3rd Qu.:10.1241
Max. :22.8479
Patient_ID <- anns$Donor_ID
Gender <- anns$Gender
Age <- anns$Age
Cell_Type <- anns$Cell_Type
Phenotype <- anns$Condition
Ethnicity <- anns$Ethnicity
Mito_Frac <- anns$Mitochondrial_Fraction
table(Cell_Type, Patient_ID)
Patient_ID
Cell_Type Non T2D 1 Non T2D 10 Non T2D 11 Non T2D 12 Non T2D 2 Non T2D 3 alpha 56 20 49 92 4 29 beta 13 26 12 50 13 10 delta 5 2 1 2 2 3 PP 3 1 1 0 5 2 Patient_ID Cell_Type Non T2D 4 Non T2D 5 Non T2D 6 Non T2D 7 Non T2D 8 Non T2D 9 alpha 15 26 38 8 14 26 beta 11 24 7 17 17 7 delta 0 5 4 0 3 1 PP 5 2 1 4 11 4 Patient_ID Cell_Type T2D 1 T2D 2 T2D 3 T2D 4 T2D 5 T2D 6 alpha 39 50 33 152 161 134 beta 12 18 29 144 12 81 delta 4 5 4 8 1 8 PP 36 10 3 2 2 1
ContCoef(table(Cell_Type, Patient_ID))
[1] 0.4832611
table(Cell_Type, Gender)
Gender
Cell_Type female male alpha 358 588 beta 244 259 delta 24 34 PP 32 61
ContCoef(table(Cell_Type, Gender))
[1] 0.1033314
table(Cell_Type, Age)
Age
Cell_Type 23 24 27 29 31 32 37 41 42 43 51 55 56 57 60 68 alpha 85 26 26 8 49 4 50 152 161 20 134 33 107 39 14 38 beta 23 7 24 17 12 13 18 144 12 26 81 29 61 12 17 7 delta 8 1 5 0 1 2 5 8 1 2 8 4 2 4 3 4 PP 5 4 2 4 1 5 10 2 2 1 1 3 5 36 11 1
ContCoef(table(Cell_Type, Age))
[1] 0.4782088
table(Cell_Type, Phenotype)
Phenotype
Cell_Type non-diabetic T2D alpha 377 569 beta 207 296 delta 28 30 PP 39 54
ContCoef(table(Cell_Type, Phenotype))
[1] 0.03317408
table(Cell_Type, Ethnicity)
Ethnicity
Cell_Type AA AI C H alpha 213 14 384 335 beta 99 17 150 237 delta 16 3 22 17 PP 41 11 37 4
ContCoef(table(Cell_Type, Ethnicity))
[1] 0.2517959
ContCoef(table(Patient_ID, Gender))
[1] 0.7071068
ContCoef(table(Patient_ID, Age))
[1] 0.9682458
ContCoef(table(Patient_ID, Phenotype))
[1] 0.7071068
ContCoef(table(Patient_ID, Ethnicity))
[1] 0.8660254
ContCoef(table(Gender, Age))
[1] 0.6803731
ContCoef(table(Gender, Phenotype))
[1] 0.1724809
ContCoef(table(Gender, Ethnicity))
[1] 0.4526994
ContCoef(table(Age, Phenotype))
[1] 0.7071068
ContCoef(table(Age, Ethnicity))
[1] 0.8569589
ContCoef(table(Phenotype, Ethnicity))
[1] 0.4785346
raw.counts <- counts
summary(colSums(counts))
Min. 1st Qu. Median Mean 3rd Qu. Max. 199628 773880 1001875 1104286 1288589 7369004
counts <- normalize(counts)
summary(colSums(counts))
Min. 1st Qu. Median Mean 3rd Qu. Max. 1001875 1001875 1001875 1001875 1001875 1001875
Note that the orignial cell assignments are highly correlated with known factors.
It is well known that the number of detected genes in each cell explains a very large portion of variability in scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the number of detected genes (e.g., r > 0.9). Here, we calculate the number of detected genes for islet cells, which will be used as an known factor in the IA-SVA analyses.
Num_Detected_Genes <- colSums(counts>0)
summary(Num_Detected_Genes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3261 6890 7832 7834 8766 12067
barplot(Num_Detected_Genes, xlab="Cell", las=2, ylab = "Number of detected genes")
lcounts <- log(counts + 1)
# PCA
pca.res = irlba(lcounts - rowMeans(lcounts), 5)$v
cor(Num_Detected_Genes, pca.res[,1])
[1] 0.8906152
dim(lcounts)
[1] 19169 1600
For comparison purposes,we applied tSNE on read counts of all genes. We used the Rtsne R package with default settings. Genes are color coded wrt their expression of marker genes.
set.seed(34544532)
tsne.res.all <- Rtsne(t(lcounts), dims = 2)
plot(tsne.res.all$Y, main="tSNE", xlab="tSNE Dim1",
ylab="tSNE Dim2",pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
par(mfrow=c(2,2))
plot(tsne.res.all$Y, main="Gender", xlab="Dimension 1",
ylab="Dimension 2", pch=21, col=color.vec[Gender],
bg=color.vec[Gender], oma=c(4,4,6,12))
legend("topleft", levels(Gender), border="white",
fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Patient ID", xlab="Dimension 1",
ylab="Dimension 2", pch=21, col=tol21rainbow[Patient_ID],
bg=tol21rainbow[Patient_ID], oma=c(4,4,6,12))
legend("topleft", levels(Patient_ID), border="white",
fill=tol21rainbow, bty="n", cex=0.5)
plot(tsne.res.all$Y, main="Ethnicity", xlab="Dimension 1",
ylab="Dimension 2", pch=21, col=color.vec[Ethnicity],
bg=color.vec[Ethnicity], oma=c(4,4,6,12))
legend("topleft", levels(Ethnicity), border="white",
fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Phenotype", xlab="Dimension 1",
ylab="Dimension 2", pch=21, col=color.vec[Phenotype],
bg=color.vec[Phenotype], oma=c(4,4,6,12))
legend("topleft", levels(Phenotype), border="white", fill=color.vec, bty="n")
par(mfrow=c(1,1))
Known factors deteriorate the performance of t-SNE.
# specify gene number to select for
gene_num <- 1000
# calcuclate dispersion
row.var <- apply(lcounts,1,sd)**2
row.mean <- apply(lcounts,1,mean)
dispersion <- row.var/row.mean
# generate sequence of bins
bins <- seq(from = min(row.mean), to = max(row.mean), length.out = 20)
# sort mean expression data into the bins
bin.exp <- row.mean
# sort the values
bin.sort <- sort(bin.exp, decreasing = FALSE)
# vector of bin assignment
cuts <- cut(x = bin.exp, breaks = bins, labels = FALSE)
# find which are NA and change to zero
na.ids <- which(is.na(cuts) == TRUE)
cuts[na.ids] <- 0
# create an empty vector for overdispersion
overdispersion <- NULL
# for each gene and bin index, calculate median, mad, and then normalized dispersion
# first loop through length of bins found
for (k in 1:length(names(table(cuts)))) {
# find index of bins
bin.id <- which(cuts == names(table(cuts))[k])
# median of all genes in the bin
median.bin <- median(dispersion[bin.id], na.rm = TRUE)
# calculate mad (median absolute deviation)
mad.bin <- mad(dispersion[bin.id])
# calculate norm dispersion for each gene
for (m in 1:length(bin.id)) {
norm.dispersion <- abs(dispersion[bin.id[m]] - median.bin)/mad.bin
overdispersion <- c(overdispersion, norm.dispersion)
}
}
# remove nans
overdis.na <- which(is.na(overdispersion) == TRUE)
if (length(overdis.na) > 0) {
overdis.filt <- overdispersion[-overdis.na]
} else {
overdis.filt <- overdispersion
}
# plot mean expression vs overdisperssion
ids <- which(names(overdis.filt) %in% names(row.mean))
plot(row.mean[ids], overdis.filt)
# Do t-sne using top over-dispersed genes (apply mean expression filter too)
rank.ov <- order(overdis.filt, decreasing = TRUE)
ov.genes <- names(overdis.filt[rank.ov[1:gene_num]])
log.sel <- lcounts[ov.genes,]
all1 <- t(log.sel)
# Remove groups that are all zeros
df <- all1[,apply(all1, 2, var, na.rm=TRUE) != 0]
set.seed(45344)
rtsne_out <- Rtsne(as.matrix(df), dims = 3)
# Set rownames of matrix to tsne matrix
rownames(rtsne_out$Y) <- rownames(df)
plot(rtsne_out$Y[,c(1,2)], main="CellView", xlab="Dimension 1",
ylab="Dimension 2",pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
## Run Seurat to cluster islet cells
set.seed(12344)
seurat.obj <- CreateSeuratObject(raw.data=counts,
min.cells=3, min.genes=200, project="Seurat_Comp")
names(Patient_ID) <- rownames(seurat.obj@meta.data)
seurat.obj <- AddMetaData(object = seurat.obj,
metadata = Patient_ID, col.name = "patient.id")
# Normalizing the data
seurat.obj <- NormalizeData(object = seurat.obj, normalization.method = "LogNormalize",
scale.factor = 10000)
# Detection of variable genes across the single cells
seurat.obj <- FindVariableGenes(object = seurat.obj, mean.function = ExpMean, dispersion.function = LogVMR,
x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
length(x = seurat.obj@var.genes) #6639
[1] 6639
# Scaling the data and removing unwanted sources of variation
seurat.obj <- ScaleData(object = seurat.obj, vars.to.regress = c("nGene", "patient.id"))
Regressing out: nGene, patient.id
Time Elapsed: 31.3347609043121 secs
Scaling data matrix
# Perform linear dimensional reduction
seurat.obj <- RunPCA(object = seurat.obj, pc.genes = seurat.obj@var.genes, do.print = TRUE, pcs.print = 1:5,
genes.print = 5)
[1] "PC1"
[1] "TM4SF4" "GC" "GLS" "FAP" "PAPPA2"
[1] ""
[1] "HADH" "TIMP2" "SYNE2" "ADCYAP1" "PCSK1"
[1] ""
[1] ""
[1] "PC2"
[1] "ADCYAP1" "HADH" "ERO1LB" "PFKFB2" "NPTX2"
[1] ""
[1] "IFITM3" "SPARC" "LGALS1" "COL1A2" "COL1A1"
[1] ""
[1] ""
[1] "PC3"
[1] "COL1A2" "COL3A1" "BGN" "SPARC" "COL15A1"
[1] ""
[1] "CFTR" "MMP7" "LCN2" "KRT19" "PDZK1IP1"
[1] ""
[1] ""
[1] "PC4"
[1] "SLITRK6" "CALCRL" "THSD7A" "CALB1" "ETV1"
[1] ""
[1] "PFKFB2" "ANXA2" "ROBO2" "RIN2" "ENO1"
[1] ""
[1] ""
[1] "PC5"
[1] "ETV1" "CALB1" "SLITRK6" "SPOCK1" "THSD7A"
[1] ""
[1] "PODXL" "CD93" "FLT1" "PLVAP" "ESAM"
[1] ""
[1] ""
# Determine statistically significant principal components
if(FALSE){
seurat.obj <- JackStraw(object = seurat.obj, num.replicate = 100, do.print = FALSE)
JackStrawPlot(object = seurat.obj, PCs = 1:20)
PCElbowPlot(object = seurat.obj)
# Cluster the cells
seurat.obj <- FindClusters(object = seurat.obj, reduction.type = "pca", dims.use = 1:10,
resolution = 0.6, print.output = 0, save.SNN = TRUE)
PrintFindClustersParams(object = seurat.obj)
}
set.seed(12344454)
seurat.obj <- RunTSNE(object = seurat.obj, dims.use = 1:10, do.fast = TRUE)
TSNEPlot(object = seurat.obj)
# with true cell types
plot(seurat.obj@dr$tsne@cell.embeddings[,c(1,2)],
main="Spectral tSNE (Seurat)", xlab="Dimension 1",
ylab="Dimension 2",pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
## Run PCA to cluster islet cells. Here, we applied PCA on read counts of all genes. Genes are color coded with their expression of marker genes.
pairs(pca.res[,1:4], main="PCA", pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
Here, for comparison purposes we conduct SVA on our example data while adjusting for Patient_ID and Geo_Lib_Size and obtained 4 hidden factors.
set.seed(34544455)
mod1 <- model.matrix(~Patient_ID+Num_Detected_Genes)
mod0 <- cbind(mod1[,1])
sva.res = svaseq(counts,mod1,mod0, n.sv=4)$sv
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
pairs(sva.res[,1:4], main="SVA", pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
Here, we run IA-SVA using Patient_ID and Geo_Lib_Size as known factors and identify 4 hidden factors.
set.seed(345466666)
mod <- model.matrix(~Patient_ID+Num_Detected_Genes)
summ_exp <- SummarizedExperiment(assays = counts)
iasva.res<- iasva(summ_exp, mod[,-1],verbose=FALSE, permute=FALSE, num.sv=4)
IA-SVA running...
SV 1 Detected!
SV 2 Detected!
SV 3 Detected!
SV 4 Detected!
# of significant surrogate variables: 4
iasva.sv <- iasva.res$sv
## no color
pairs(iasva.sv[,1:4], pch=21, col="black", bg="black", cex=0.8)
## with color-coding
pairs(iasva.sv[,1:4], main="IA-SVA", pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
cor(iasva.sv)
SV1 SV2 SV3 SV4
SV1 1.00000000 0.03808617 0.06200799 0.2507764
SV2 0.03808617 1.00000000 -0.15724552 -0.1830763
SV3 0.06200799 -0.15724552 1.00000000 0.2474547
SV4 0.25077642 -0.18307630 0.24745468 1.0000000
corrplot(cor(iasva.sv))
Here, try different R2 cutoffs
no.clusters <- 4
C.scores <- matrix(0,0,0)
Number.of.genes <- matrix(0,0,0)
for (i in seq(0.1,0.9,0.05)){
marker.counts <- find_markers(summ_exp, as.matrix(iasva.sv[,2]), rsq.cutoff = i)
no.genes <- dim(marker.counts)[1]
if(no.genes == 0){
break
}
else{
my.dist <- dist(t(log(marker.counts+1)))
my.clustering <- hclust(my.dist, method = "ward.D2")
my.silhoutte <-silhouette(cutree(my.clustering,no.clusters),my.dist)
C1 <- mean(my.silhoutte[my.silhoutte[,1]==1,3])
C2 <- mean(my.silhoutte[my.silhoutte[,1]==2,3])
average.C <- (C1+C2)/2
C.scores <- c(C.scores, average.C)
Number.of.genes <- c(Number.of.genes,no.genes)
}
}
# of markers (): 2204
total # of unique markers: 2204
# of markers (): 1326
total # of unique markers: 1326
# of markers (): 839
total # of unique markers: 839
# of markers (): 528
total # of unique markers: 528
# of markers (): 329
total # of unique markers: 329
# of markers (): 190
total # of unique markers: 190
# of markers (): 111
total # of unique markers: 111
# of markers (): 57
total # of unique markers: 57
# of markers (): 30
total # of unique markers: 30
# of markers (): 15
total # of unique markers: 15
# of markers (): 2
total # of unique markers: 2
# of markers (): 0
total # of unique markers: 0
output.matrix <- rbind(C.scores, Number.of.genes)
pdf("output/Clustering_analyses_figure4_Xin.pdf")
par(mar=c(5,5,5,5))
end.point <- (length(C.scores)-1)*0.05+0.1
plot(Number.of.genes, xlab = "R^2", ylab = "Number genes selected", xaxt="n", main = "Number of selected genes vs. Cluster quality", pch = 18, col ="blue",type="b", lty=2, cex=2)
Axis(1,at=seq(1,length(Number.of.genes)), side = 1, labels=seq(0.1,end.point,0.05),las = 2)
par(new=T)
plot(C.scores, xlab='', ylab='', axes=F, pch = 18 , col ="red",type="b", lty=2, cex=2)
Axis(side=4)
mtext(side = 4, line = 2, 'Average Silhoutte Score', col = "red")
par(new=F)
dev.off()
quartz_off_screen
2
# try different R2 thresholds
pdf("output/Clustering_analyses_figure4_Xin_sv1.pdf")
r2.results <- study_R2(summ_exp, iasva.sv,selected.svs=1, no.clusters=2)
# of markers (): 369
total # of unique markers: 369
# of markers (): 198
total # of unique markers: 198
# of markers (): 123
total # of unique markers: 123
# of markers (): 87
total # of unique markers: 87
# of markers (): 57
total # of unique markers: 57
# of markers (): 44
total # of unique markers: 44
# of markers (): 37
total # of unique markers: 37
# of markers (): 26
total # of unique markers: 26
# of markers (): 20
total # of unique markers: 20
# of markers (): 14
total # of unique markers: 14
# of markers (): 12
total # of unique markers: 12
# of markers (): 8
total # of unique markers: 8
# of markers (): 7
total # of unique markers: 7
# of markers (): 5
total # of unique markers: 5
# of markers (): 4
total # of unique markers: 4
# of markers (): 1
total # of unique markers: 1
# of markers (): 0
total # of unique markers: 0
dev.off()
quartz_off_screen
2
marker.counts <- find_markers(summ_exp, as.matrix(iasva.sv[,c(1)]), rsq.cutoff = 0.3)
# of markers (): 57
total # of unique markers: 57
nrow(marker.counts)
[1] 57
marker.counts.long <- find_markers(summ_exp, as.matrix(iasva.sv[,c(1)]), rsq.cutoff = 0.2)
# of markers (): 123
total # of unique markers: 123
nrow(marker.counts.long)
[1] 123
anno.col <- data.frame(Cell_Type=Cell_Type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
Cell_Type
SRR3541303 beta
SRR3541304 beta
SRR3541305 beta
SRR3541306 beta
SRR3541307 beta
SRR3541308 beta
cell.type.col <- color.vec[1:4]
names(cell.type.col) <- c("alpha","beta","delta","PP")
anno.colors <- list(Cell_Type=cell.type.col)
pheatmap(log(marker.counts+1), show_colnames =FALSE, show_rownames = TRUE,
clustering_method = "ward.D2",cutree_cols = 4,
annotation_col = anno.col, annotation_colors=anno.colors)
marker.counts.SV3 <- find_markers(summ_exp, as.matrix(iasva.sv[,c(3)]), rsq.cutoff = 0.3)
# of markers (): 54
total # of unique markers: 54
nrow(marker.counts.SV3)
[1] 54
marker.counts.SV3.long <- find_markers(summ_exp, as.matrix(iasva.sv[,c(3)]), rsq.cutoff = 0.2)
# of markers (): 138
total # of unique markers: 138
nrow(marker.counts.SV3.long)
[1] 138
anno.col <- data.frame(Cell_Type=Cell_Type, SV3=iasva.sv[,3])
rownames(anno.col) <- colnames(marker.counts.SV3)
head(anno.col)
Cell_Type SV3
SRR3541303 beta 0.0070956379
SRR3541304 beta -0.0006460813
SRR3541305 beta -0.0116911796
SRR3541306 beta -0.0044453589
SRR3541307 beta 0.0008762206
SRR3541308 beta -0.0080419417
cell.type.col <- color.vec[1:4]
names(cell.type.col) <- c("alpha","beta","delta","PP")
anno.colors <- list(Cell_Type=cell.type.col)
pheatmap(log(marker.counts.SV3+1), show_colnames =FALSE, show_rownames = TRUE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col, annotation_colors=anno.colors)
set.seed(75458456)
tsne.res.iasva <- Rtsne(unique(t(log(marker.counts+1))), dims = 2)
plot(tsne.res.iasva$Y[,1:2], main="IA-SVA", xlab="tSNE Dim1",
ylab="tSNE Dim2", pch=21, col=color.vec[Cell_Type],
bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
tSNE performed on marker genes selected via IA-SVA performs better than original tSNE analyses using all genes. This example again reiterates the importance of gene selection using IA-SVA for effective clustering of single-cell datasets.
pdf(file="output/Xin_Islets_All_demensionality_reduction_Figure4DEFG.pdf", width=8, height=9)
par(mfrow=c(2,2))
plot(tsne.res.all$Y, main="tSNE", xlab="Dimension 1", ylab="Dimension 2",pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type])
legend("topleft", levels(Cell_Type), border="white", fill=color.vec, bty="n")
plot(tsne.res.iasva$Y, main="IA-SVA + tSNE", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type])
plot(rtsne_out$Y[,c(1,2)], main="CellView", xlab="Dimension 1", ylab="Dimension 2",pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type])
plot(seurat.obj@dr$tsne@cell.embeddings[,c(1,2)], main="Spectral tSNE (Seurat)", xlab="Dim1", ylab="Dim2",pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type])
dev.off()
quartz_off_screen
2
pdf(file="output/Xin_Islets_AllCells_tSNEByKnownFactors_FigureS9.pdf", width=10, height=11)
par(mfrow=c(2,2))
plot(tsne.res.all$Y, main="Gender", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=color.vec[Gender], bg=color.vec[Gender], oma=c(4,4,6,12))
legend("topleft", levels(Gender), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Patient ID", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=tol21rainbow[Patient_ID], bg=tol21rainbow[Patient_ID], oma=c(4,4,6,12))
legend("topleft", levels(Patient_ID), border="white", fill=tol21rainbow, bty="n", cex=0.5)
plot(tsne.res.all$Y, main="Ethnicity", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=color.vec[Ethnicity], bg=color.vec[Ethnicity], oma=c(4,4,6,12))
legend("topleft", levels(Ethnicity), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Phenotype", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=color.vec[Phenotype], bg=color.vec[Phenotype], oma=c(4,4,6,12))
legend("topleft", levels(Phenotype), border="white", fill=color.vec, bty="n")
par(mfrow=c(1,1))
dev.off()
quartz_off_screen
2
pdf(file="output/Xin_Islets_AllCells_IASVA_nocolor.pdf", width=4, height=4)
pairs(iasva.sv[,1:4], pch=21, col="black", bg="black", cex=0.3)
dev.off()
quartz_off_screen
2
pdf(file="output/Xin_Islets_AllCells_IASVA.pdf", width=4, height=4)
pairs(iasva.sv[,1:4], pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.3)
dev.off()
quartz_off_screen
2
pdf(file="output/Xin_Islets_AllCells_PCA.pdf", width=4, height=4)
pairs(pca.res[,1:4], pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.3)#
dev.off()
quartz_off_screen
2
colnames(sva.res) <- paste0("SV",1:4)
pdf(file="output/Xin_Islets_AllCells_USVA.pdf", width=4, height=4)
pairs(sva.res[,1:4], pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.3) #4,4,6,12
dev.off()
quartz_off_screen
2
anno.col <- data.frame(Cell_Type=Cell_Type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
Cell_Type
SRR3541303 beta
SRR3541304 beta
SRR3541305 beta
SRR3541306 beta
SRR3541307 beta
SRR3541308 beta
cell.type.col <- color.vec[1:4]
names(cell.type.col) <- c("alpha","beta","delta","PP")
anno.colors <- list(Cell_Type=cell.type.col)
pheatmap(log(marker.counts+1), show_colnames =FALSE, show_rownames = TRUE,
clustering_method = "ward.D2",cutree_cols = 4,annotation_col = anno.col, annotation_colors=anno.colors,
filename="output/FigureS11_Xin_Islets_AllCells_IASVA_Markers_pheatmap.pdf", width=10, height=10)
write.table(as.data.frame(rownames(marker.counts)), file="output/Xin_Islets_AllCells_SV1_Genes_rsqcutoff0.3.txt", quote=F,
row.names=F, col.names=F, sep=" ")
write.table(as.data.frame(rownames(marker.counts.long)), file="output/Xin_Islets_AllCells_SV1_Genes_rsqcutoff0.2.txt", quote=F,
row.names=F, col.names=F, sep=" ")
write.table(as.data.frame(rownames(marker.counts.SV3)), file="output/Xin_Islets_AllCells_SV3_Genes_rsqcutoff0.3.txt", quote=F,
row.names=F, col.names=F, sep=" ")
write.table(as.data.frame(rownames(marker.counts.SV3.long)), file="output/Xin_Islets_AllCells_SV3_Genes_rsqcutoff0.2.txt", quote=F,
row.names=F, col.names=F, sep=" ")
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] SummarizedExperiment_1.10.1 DelayedArray_0.6.1
[3] matrixStats_0.53.1 Biobase_2.40.0
[5] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0
[7] IRanges_2.14.10 S4Vectors_0.18.3
[9] BiocGenerics_0.26.0 cluster_2.0.7-1
[11] RColorBrewer_1.1-2 DescTools_0.99.24
[13] corrplot_0.84 pheatmap_1.0.10
[15] Rtsne_0.13 irlba_2.3.2
[17] dbscan_1.1-2 iasvaExamples_1.0.0
[19] iasva_0.99.3 Seurat_2.3.2
[21] Matrix_1.2-14 cowplot_0.9.2
[23] ggplot2_2.2.1.9000 sva_3.28.0
[25] BiocParallel_1.14.2 genefilter_1.62.0
[27] mgcv_1.8-23 nlme_3.1-137
loaded via a namespace (and not attached):
[1] reticulate_1.8 R.utils_2.6.0 tidyselect_0.2.4
[4] RSQLite_2.1.1 AnnotationDbi_1.42.1 htmlwidgets_1.2
[7] grid_3.5.0 trimcluster_0.1-2 ranger_0.10.1
[10] munsell_0.4.3 codetools_0.2-15 ica_1.0-2
[13] withr_2.1.2 colorspace_1.3-2 knitr_1.20
[16] rstudioapi_0.7 geometry_0.3-6 ROCR_1.0-7
[19] robustbase_0.93-0 dtw_1.20-1 dimRed_0.1.0
[22] labeling_0.3 lars_1.2 git2r_0.21.0
[25] GenomeInfoDbData_1.1.0 mnormt_1.5-5 bit64_0.9-7
[28] rprojroot_1.3-2 ipred_0.9-6 diptest_0.75-7
[31] R6_2.2.2 VGAM_1.0-5 hdf5r_1.0.0
[34] flexmix_2.3-14 DRR_0.0.3 bitops_1.0-6
[37] assertthat_0.2.0 SDMTools_1.1-221 scales_0.5.0
[40] nnet_7.3-12 gtable_0.2.0 ddalpha_1.3.3
[43] workflowr_1.0.1 timeDate_3043.102 rlang_0.2.1
[46] CVST_0.2-2 scatterplot3d_0.3-41 RcppRoll_0.3.0
[49] splines_3.5.0 lazyeval_0.2.1 ModelMetrics_1.1.0
[52] acepack_1.4.1 broom_0.4.4 checkmate_1.8.5
[55] yaml_2.1.19 reshape2_1.4.3 abind_1.4-5
[58] backports_1.1.2 Hmisc_4.1-1 caret_6.0-80
[61] tools_3.5.0 lava_1.6.1 psych_1.8.4
[64] gplots_3.0.1 proxy_0.4-22 ggridges_0.5.0
[67] Rcpp_0.12.17 plyr_1.8.4 base64enc_0.1-3
[70] zlibbioc_1.26.0 purrr_0.2.5 RCurl_1.95-4.10
[73] rpart_4.1-13 pbapply_1.3-4 zoo_1.8-2
[76] sfsmisc_1.1-2 magrittr_1.5 data.table_1.11.4
[79] manipulate_1.0.1 lmtest_0.9-36 RANN_2.5.1
[82] mvtnorm_1.0-8 whisker_0.3-2 fitdistrplus_1.0-9
[85] evaluate_0.10.1 xtable_1.8-2 XML_3.98-1.11
[88] mclust_5.4 gridExtra_2.3 compiler_3.5.0
[91] tibble_1.4.2 KernSmooth_2.23-15 R.oo_1.22.0
[94] htmltools_0.3.6 segmented_0.5-3.0 Formula_1.2-3
[97] snow_0.4-2 tidyr_0.8.1 expm_0.999-2
[100] tclust_1.4-1 lubridate_1.7.4 DBI_1.0.0
[103] diffusionMap_1.1-0 magic_1.5-8 MASS_7.3-50
[106] fpc_2.1-11 boot_1.3-20 R.methodsS3_1.7.1
[109] gdata_2.18.0 metap_0.9 bindr_0.1.1
[112] gower_0.1.2 igraph_1.2.1 pkgconfig_2.0.1
[115] foreign_0.8-70 recipes_0.1.3 foreach_1.4.4
[118] annotate_1.58.0 XVector_0.20.0 prodlim_2018.04.18
[121] stringr_1.3.1 digest_0.6.15 pls_2.6-0
[124] tsne_0.1-3 rmarkdown_1.9 htmlTable_1.12
[127] kernlab_0.9-26 gtools_3.5.0 modeltools_0.2-21
[130] jsonlite_1.5 bindrcpp_0.2.2 limma_3.36.2
[133] pillar_1.2.3 lattice_0.20-35 DEoptimR_1.0-8
[136] survival_2.42-3 glue_1.2.0 FNN_1.1
[139] png_0.1-7 prabclus_2.2-6 iterators_1.0.9
[142] bit_1.1-14 class_7.3-14 stringi_1.2.2
[145] mixtools_1.1.0 blob_1.1.1 doSNOW_1.0.16
[148] latticeExtra_0.6-28 caTools_1.17.1 memoise_1.1.0
[151] dplyr_0.7.5 ape_5.1
This reproducible R Markdown analysis was created with workflowr 1.0.1