Last updated: 2018-08-03

workflowr checks: (Click a bullet for more information)
Expand here to see past versions:


To make simulation studies more objective, we used a simulation design suggested by the author of svaseq (http://jtleek.com/svaseq/simulateData.html). We slightly modified the original design to simulate realistic single cell RNA sequencing (scRNA-seq) data sets (read counts with a high dropout rate) and multiple correlated hidden factors. We simulated baseline scRNA-seq data using the Polyester R package (https://bioconductor.org/packages/release/bioc/html/polyester.html) by using the zero-inflated negative binomial distribution parameters estimated from a real-world scRNA-seq data obtained from human pancreatic islet samples (Lawlor et. al., 2016). The islet scRNA-seq dataset is included in a R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the ‘iasvaExamples’ package, follow the instruction provided in the GitHub page.

Load packages

rm(list=ls())
library(iasva)
library(iasvaExamples)
library(polyester)
library(sva)
library(corrplot)
library(DescTools) #pcc i.e., Pearson's contingency coefficient
library(RColorBrewer)
library(SummarizedExperiment)
color.vec <- brewer.pal(8, "Set1")

Load the islet single cell RNA-Seq data (n=638 cells, and 26K genes)

data("Lawlor_Islet_scRNAseq_Read_Counts")
data("Lawlor_Islet_scRNAseq_Annotations")
ls()
[1] "color.vec"                         "Lawlor_Islet_scRNAseq_Annotations"
[3] "Lawlor_Islet_scRNAseq_Read_Counts"
counts <- Lawlor_Islet_scRNAseq_Read_Counts
anns <- Lawlor_Islet_scRNAseq_Annotations
dim(anns)
[1] 638  26
dim(counts)
[1] 26542   638
summary(anns)
     run             cell.type             COL1A1          INS       
 Length:638         Length:638         Min.   :1.00   Min.   :1.000  
 Class :character   Class :character   1st Qu.:1.00   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :1.00   Median :1.000  
                                       Mean   :1.03   Mean   :1.414  
                                       3rd Qu.:1.00   3rd Qu.:2.000  
                                       Max.   :2.00   Max.   :2.000  
                                                                     
     PRSS1            SST             GCG            KRT19      
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :1.000   Median :1.000   Median :1.000   Median :1.000  
 Mean   :1.038   Mean   :1.039   Mean   :1.375   Mean   :1.044  
 3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
                                                                
      PPY          num.genes             Cell_ID        UNOS_ID   
 Min.   :1.000   Min.   :3529   10th_C1_S59  :  1   ACCG268 :136  
 1st Qu.:1.000   1st Qu.:5270   10th_C10_S104:  1   ACJV399 :108  
 Median :1.000   Median :6009   10th_C11_S96 :  1   ACEL337 :103  
 Mean   :1.028   Mean   :5981   10th_C13_S61 :  1   ACIW009 : 93  
 3rd Qu.:1.000   3rd Qu.:6676   10th_C14_S53 :  1   ACCR015A: 57  
 Max.   :2.000   Max.   :8451   10th_C16_S105:  1   ACIB065 : 57  
                                (Other)      :632   (Other) : 84  
      Age        Biomaterial_Provider    Gender              Phenotype  
 Min.   :22.00   IIDP      : 45       Female:303   Non-Diabetic   :380  
 1st Qu.:29.00   Prodo Labs:593       Male  :335   Type 2 Diabetic:258  
 Median :42.00                                                          
 Mean   :39.63                                                          
 3rd Qu.:53.00                                                          
 Max.   :56.00                                                          
                                                                        
               Race          BMI          Cell_Type     Patient_ID 
 African American:175   Min.   :22.00   INS    :264   P1     :136  
 Hispanic        :165   1st Qu.:26.60   GCG    :239   P8     :108  
 White           :298   Median :32.95   KRT19  : 28   P3     :103  
                        Mean   :32.85   SST    : 25   P7     : 93  
                        3rd Qu.:35.80   PRSS1  : 24   P5     : 57  
                        Max.   :55.00   none   : 21   P6     : 57  
                                        (Other): 37   (Other): 84  
 Sequencing_Run Batch       Coverage       Percent_Aligned 
 12t    : 57    B1:193   Min.   :1206135   Min.   :0.8416  
 4th    : 57    B2:148   1st Qu.:2431604   1st Qu.:0.8769  
 9th    : 57    B3:297   Median :3042800   Median :0.8898  
 10t    : 56             Mean   :3160508   Mean   :0.8933  
 7th    : 55             3rd Qu.:3871697   3rd Qu.:0.9067  
 3rd    : 53             Max.   :5931638   Max.   :0.9604  
 (Other):303                                               
 Mitochondrial_Fraction Num_Expressed_Genes
 Min.   :0.003873       Min.   :3529       
 1st Qu.:0.050238       1st Qu.:5270       
 Median :0.091907       Median :6009       
 Mean   :0.108467       Mean   :5981       
 3rd Qu.:0.142791       3rd Qu.:6676       
 Max.   :0.722345       Max.   :8451       
                                           
ContCoef(table(anns$Gender, anns$Cell_Type))
[1] 0.225969
ContCoef(table(anns$Phenotype, anns$Cell_Type))
[1] 0.1145096
ContCoef(table(anns$Race, anns$Cell_Type))
[1] 0.3084146
ContCoef(table(anns$Patient_ID, anns$Cell_Type))
[1] 0.5232058
ContCoef(table(anns$Batch, anns$Cell_Type))
[1] 0.3295619

Look at mean variance relationship

plot(rowMeans(log(counts+1)),rowVars(log(counts+1)),pch=19,col=color.vec[2])

Estimate zero inflated negative binomial parameters from the islet data

## Estimate the zero inflated negative binomial parameters
#set.seed(12345)
set.seed(2018)
params = get_params(counts)

Simulate a scRNA-seq data set affected by a known factor and three hidden factors

sample.size <- 50
num.genes <- 10000
prop.kfactor.genes <- 0.1    #known factor
prop.hfactor1.genes <- 0.3   #hidden factor1
prop.hfactor2.genes <- 0.2   #hidden factor2
prop.hfactor3.genes <- 0.1   #hidden factor3

num.kfactor.genes <- num.genes*prop.kfactor.genes
num.hfactor1.genes <- num.genes*prop.hfactor1.genes
num.hfactor2.genes <- num.genes*prop.hfactor2.genes
num.hfactor3.genes <- num.genes*prop.hfactor3.genes

factor.prop <- 0.5
flip.prob <- 0.8 # high corrleation between factors 
#flip.prob <- 0.5 # row correlation between factors

# make first 10% of genes affected by the known factor.
kfactor = c(rep(-1,each=sample.size*factor.prop),rep(1,each=sample.size-(sample.size*factor.prop)))

coinflip = rbinom(sample.size,size=1,prob=flip.prob)
hfactor1 = kfactor*coinflip + -kfactor*(1-coinflip)
coinflip = rbinom(sample.size,size=1,prob=flip.prob)
hfactor2 = kfactor*coinflip + -kfactor*(1-coinflip)
coinflip = rbinom(sample.size,size=1,prob=flip.prob)
hfactor3 = kfactor*coinflip + -kfactor*(1-coinflip)

corrplot.mixed(cor(cbind(kfactor,hfactor1,hfactor2,hfactor3)))

hfactor.mat <- cbind(hfactor1,hfactor2,hfactor3)
kfcoeffs = c(rnorm(num.kfactor.genes),rep(0,num.genes-num.kfactor.genes))
nullindex= (num.kfactor.genes+1):num.genes

hfcoeffs1 <- rep(0,num.genes)
hfcoeffs2 <- rep(0,num.genes)
hfcoeffs3 <- rep(0,num.genes)

## genes affected by each hidden factor is randomly selected.
hfcoeffs1[sample(1:num.genes, num.hfactor1.genes)] <- rnorm(num.hfactor1.genes, sd=1)
hfcoeffs2[sample(1:num.genes, num.hfactor2.genes)] <- rnorm(num.hfactor2.genes, sd=1)
hfcoeffs3[sample(1:num.genes, num.hfactor3.genes)] <- rnorm(num.hfactor3.genes, sd=1)

coeffs = cbind(hfcoeffs1, hfcoeffs2, hfcoeffs3, kfcoeffs)
controls = ((hfcoeffs1!=0)|(hfcoeffs2!=0)|(hfcoeffs3!=0))&(kfcoeffs==0)

par(mfrow=c(5,1))
plot(kfcoeffs)
plot(hfcoeffs1)
plot(hfcoeffs2)
plot(hfcoeffs3)
plot(controls)

par(mfrow=c(1,1))

mod = model.matrix(~-1 + hfactor1 + hfactor2 + hfactor3 + kfactor)

dat0 = create_read_numbers(params$mu,params$fit,
                                     params$p0,beta=coeffs,mod=mod)

sum(dat0==0)/length(dat0)
[1] 0.681446
filter = apply(dat0, 1, function(x) length(x[x>5])>=3)
dat0 = dat0[filter,]
sum(dat0==0)/length(dat0)
[1] 0.52831
controls <- controls[filter]

dim(dat0)
[1] 6580   50
dim(mod)
[1] 50  4

Estimate hidden factors using USVA, SSVA and IA-SVA and compare them

## set null and alternative models
mod1 = model.matrix(~kfactor)
mod0 = cbind(mod1[,1])

## Estimate hidden factors with unsuprvised SVA
usva.res = svaseq(dat0,mod1,mod0)$sv
Number of significant surrogate variables is:  3 
Iteration (out of 5 ):1  2  3  4  5  
cor(hfactor1, usva.res[,1])
[1] -0.9291762
cor(hfactor2, usva.res[,2])
[1] 0.6278626
cor(hfactor3, usva.res[,3])
[1] -0.5382027
## Estimate hidden factors with supervised SVA
ssva.res = svaseq(dat0,mod1,mod0,controls=controls)$sv
sva warning: controls provided so supervised sva is being performed.
Number of significant surrogate variables is:  3 
cor(hfactor1, ssva.res[,1])
[1] -0.9270694
cor(hfactor2, ssva.res[,2])
[1] -0.6297534
cor(hfactor3, ssva.res[,3])
[1] -0.5628941
## Estimate hidden factors with IA-SVA
summ_exp <- SummarizedExperiment(assays = dat0)
iasva.res <- iasva(summ_exp, as.matrix(mod1[,-1]), num.p = 20)$sv # the default number of permutations in SVA = 20
IA-SVA running...

SV 1 Detected!

SV 2 Detected!

SV 3 Detected!

# of significant surrogate variables: 3
cor(hfactor1, iasva.res[,1])
[1] -0.9944336
cor(hfactor2, iasva.res[,2])
[1] 0.9901559
cor(hfactor3, iasva.res[,3])
[1] 0.9797604
hf1.mat <- cbind(hfactor1, usva.res[,1], ssva.res[,1], iasva.res[,1])
hf2.mat <- cbind(hfactor2, usva.res[,2], ssva.res[,2], iasva.res[,2])
hf3.mat <- cbind(hfactor3, usva.res[,3], ssva.res[,3], iasva.res[,3])
colnames(hf1.mat) <- c("HF1", "USVA", "SSVA", "IA-SVA")
colnames(hf2.mat) <- c("HF2", "USVA", "SSVA", "IA-SVA")
colnames(hf3.mat) <- c("HF3", "USVA", "SSVA", "IA-SVA")

par(mfrow=c(2,2))
corrplot.mixed(abs(cor(hf1.mat)), tl.pos="lt")
corrplot.mixed(abs(cor(hf2.mat)), tl.pos="lt")
corrplot.mixed(abs(cor(hf3.mat)), tl.pos="lt")

Session information

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.10.1 DelayedArray_0.6.1         
 [3] matrixStats_0.53.1          Biobase_2.40.0             
 [5] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
 [7] IRanges_2.14.10             S4Vectors_0.18.3           
 [9] BiocGenerics_0.26.0         RColorBrewer_1.1-2         
[11] DescTools_0.99.24           corrplot_0.84              
[13] sva_3.28.0                  BiocParallel_1.14.2        
[15] genefilter_1.62.0           mgcv_1.8-23                
[17] nlme_3.1-137                polyester_1.16.0           
[19] iasvaExamples_1.0.0         iasva_0.99.3               

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17           mvtnorm_1.0-8          lattice_0.20-35       
 [4] Biostrings_2.48.0      rprojroot_1.3-2        digest_0.6.15         
 [7] backports_1.1.2        RSQLite_2.1.1          evaluate_0.10.1       
[10] zlibbioc_1.26.0        annotate_1.58.0        irlba_2.3.2           
[13] whisker_0.3-2          blob_1.1.1             R.utils_2.6.0         
[16] R.oo_1.22.0            Matrix_1.2-14          rmarkdown_1.9         
[19] splines_3.5.0          stringr_1.3.1          foreign_0.8-70        
[22] RCurl_1.95-4.10        bit_1.1-14             compiler_3.5.0        
[25] manipulate_1.0.1       htmltools_0.3.6        expm_0.999-2          
[28] GenomeInfoDbData_1.1.0 workflowr_1.0.1        XML_3.98-1.11         
[31] MASS_7.3-50            bitops_1.0-6           R.methodsS3_1.7.1     
[34] grid_3.5.0             xtable_1.8-2           DBI_1.0.0             
[37] git2r_0.21.0           magrittr_1.5           stringi_1.2.2         
[40] XVector_0.20.0         limma_3.36.2           boot_1.3-20           
[43] tools_3.5.0            bit64_0.9-7            logspline_2.1.11      
[46] survival_2.42-3        yaml_2.1.19            AnnotationDbi_1.42.1  
[49] cluster_2.0.7-1        memoise_1.1.0          knitr_1.20            

This reproducible R Markdown analysis was created with workflowr 1.0.1