5a). (b) AT2 cells and AM express SFTPC and MARCO, respectively. ## [100] lifecycle_1.0.3 spatstat.geom_3.1-0 lmtest_0.9-40 Therefore, as experiments that include biological replication become more common, statistical frameworks to account for multiple sources of biological variability will be critical, as recently described by Lhnemann et al. In stage ii, we assume that we have not measured cell-level covariates, so that variation in expression between cells of the same type occurs only through the dispersion parameter ij2. FindMarkers from Seurat returns p values as 0 for highly - ECHEMI In practice, this assumption is unlikely to be satisfied, but if we make modest assumptions about the growth rates of the size factors and numbers of cells per subject, we can obtain a useful approximation. The negative binomial distribution has a convenient interpretation as a hierarchical model, which is particularly useful for sequencing studies. provides an argument for using mixed models over pseudobulk methods because pseudobulk methods discovered fewer differentially expressed genes. The expression level of gene i for group 1, i1, was matched to the pig data by setting ei1=jcKijc/i'jcKi'jc. The subject method has the strongest type I error rate control and highest PPVs, wilcox has the highest TPRs and mixed has intermediate performance with better TPRs than subject yet lower FPRs than wilcox (Supplementary Table S2). ## [52] ellipsis_0.3.2 ica_1.0-3 farver_2.1.1 Our analysis of CF and non-CF pigs showed that the subject method better controlled the FPR of DS analysis when the expected rate of true positives is small; here, using the same animal model, we compare large and small airway ciliated cells which are expected to vary largely. (Zimmerman et al., 2021). In each panel, PR curves are plotted for each of seven DS analysis methods: subject (red), wilcox (blue), NB (green), MAST (purple), DESeq2 (orange), Monocle (gold) and mixed (brown). As scRNA-seq costs have decreased, collecting data from more than one biological replicate has become more feasible, but careful modeling of different layers of biological variation remains challenging for many users. We have found this particularly useful for small clusters that do not always separate using unbiased clustering, but which look tantalizingly distinct. As a gold standard, results from bulk RNA-seq of isolated AT2 cells and AM comparing IPF and healthy lungs (bulk). For clarity of exposition, we adopt and extend notations similar to (Love et al., 2014). Further, the cell-level variance and subject-level variance parameters were matched to the pig data. When only 1% of genes were differentially expressed, the mixed method had a larger area under the curve than the other five methods. First, we present a statistical model linking differences in gene counts at the cellular level to four sources: (i) subject-specific factors (e.g. This suggests that methods that fail to account for between subject differences in gene expression are more sensitive to biological variation between subjects, leading to more false discoveries. GEX_volcano : Flexible wrapper for GEX volcano plots True positives were identified as those genes in the bulk RNA-seq analysis with FDR<0.05 and |log2(IPF/healthy)|>1. "poisson" : Likelihood ratio test assuming an . The cluster contains hundreds of computation nodes with varying numbers of processor cores and memory, but all jobs were submitted to the same job queue, ensuring that the relative computation times for these jobs were comparable. Next, I'm looking to visualize this using a volcano plot using the EnhancedVolcano package: All of the other methods compute P-values that are much smaller than those computed by the permutation tests. ## attached base packages: Next, we matched the empirical moments of the distributions of Eijc and Eij to the population moments. See Supplementary Material for brief example code demonstrating the usage of aggregateBioVar. #' @param output_dir The relative directory that will be used to save results. 14.1 Basic usage. The data from pig airway epithelia underlying this article are available in GEO and can be accessed with GEO accession GSE150211. We are deprecating this functionality in favor of the patchwork system. Analysis of AT2 cells and AMs from healthy and IPF lungs. S14f), wilcox produces better ranked gene lists of known markers than both subject and wilcox and again, the mixed method has the worst performance. ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 One such subtype, defined by expression of CD66, was further processed by sorting basal cells according to detection of CD66 and profiling by bulk RNA-seq. It enables quick visual identification of genes with large fold changes that are also statistically significant. They also thank Paul A. Reyfman and Alexander V. Misharin for sharing bulk RNA-seq data used in this study. You can download this dataset from SeuratData, In addition to changes to FeaturePlot(), several other plotting functions have been updated and expanded with new features and taking over the role of now-deprecated functions. Define Kijc to be the count for gene i in cell ccollected from subject j, and a size factorsjc related to the amount of information collected from cell c in subject j (i=1,G; c=1,,Cj;j=1,,n). Further, if we assume that, for some constants k1 and k2, Cj-1csjck1 and Cj-1csjc2k2 as Cj, then the variance of Kij is ij+i+o1ij2. ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C The scRNA-seq data for the analysis of human lung tissue were obtained from GEO accession GSE122960, and the bulk RNA-seq of purified AT2 and AM fractions were shared by the authors immediately upon request. More conventional statistical techniques for hierarchical models, such as maximum likelihood or Bayesian maximum a posteriori estimation, could produce less noisy parameter estimates and hence, lead to a more powerful DS test (Gelman and Hill, 2007). ## [13] SeuratData_0.2.2 SeuratObject_4.1.3 Tau activation of microglial cGAS-IFN reduces MEF2C-mediated cognitive ## [103] jquerylib_0.1.4 RcppAnnoy_0.0.20 data.table_1.14.8 ## FindMarkers : Gene expression markers of identity classes Specifically, the CDFs are in high agreement for the subject method in the range of P-values from 0 to 0.2, whereas the mixed method has a slight inflation of small P-values in the same range compared to the permutation test. The subject method had the shortest average computation times, typically <1 min. ## [43] miniUI_0.1.1.1 Rcpp_1.0.10 viridisLite_0.4.1 For the AT2 cells (Fig. FloWuenne/scFunctions source: R/DE_Seurat.R - rdrr.io On the other hand, subject had the smallest FPR (0.03) compared to wilcox and mixed (0.26 and 0.08, respectively) and had a higher PPV (0.38 compared to 0.10 and 0.23). However, in studies with biological replication, gene expression is influenced by both cell-specific and subject-specific effects. Here is the Volcano plot: I read before that we are not allowed to do the differential gene expression using the integrated data. We designed a simulation study to examine characteristics of using subjects or cells as units of analysis for DS testing under data simulated from the proposed model. First, the CF and non-CF labels were permuted between subjects. Andrew L Thurman, Jason A Ratcliff, Michael S Chimenti, Alejandro A Pezzulo, Differential gene expression analysis for multi-subject single-cell RNA-sequencing studies with aggregateBioVar, Bioinformatics, Volume 37, Issue 19, 1 October 2021, Pages 32433251, https://doi.org/10.1093/bioinformatics/btab337. A more powerful statistical test that yields well-controlled FDR could be constructed by considering techniques that estimate all parameters of the hierarchical model. Generally, tests for marker detection, such as the wilcox method, are sufficient if type I error rate control is less of a concern than type II error rate and in circumstances where type I error rate is most important, methods like subject and mixed can be used. Returns a volcano plot from the output of the FindMarkers function from the Seurat package, which is a ggplot object that can be modified or plotted. If the ident.2 parameter is omitted or set to NULL, FindMarkers () will test for differentially expressed features between the group specified by ident.1 and all other cells. Before you start. Single-cell RNA-sequencing (scRNA-seq) enables analysis of the effects of different conditions or perturbations on specific cell types or cellular states. ", I have seen tutorials on the web, but the data there is not processed the same as how I have been doing following the Satija lab method, and, my files are not .csv, but instead are .tsv. In addition to returning a vector of cell names, CellSelector() can also take the selected cells and assign a new identity to them, returning a Seurat object with the identity classes already set. This is done by passing the Seurat object used to make the plot into CellSelector(), as well as an identity class. The expression parameter for the difference between groups 1 and 2, i2, was varied in order to evaluate the properties of DS analysis under a number of different scenarios. Figure 2 shows precision-recall (PR) curves averaged over 100 simulated datasets for each simulation setting and method. ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4 In bulk RNA-seq studies, gene counts are often assumed to follow a negative binomial distribution (Hardcastle and Kelly, 2010; Leng et al., 2013; Love et al., 2014; Robinson et al., 2010). We will create a volcano plot colouring all significant genes. First, the adjusted P-values for each method are sorted from smallest to largest. A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). I understand a little bit more now. This study found that generally pseudobulk methods and mixed models had better statistical characteristics than marker detection methods, in terms of detecting differentially expressed genes with well-controlled false discovery rates (FDRs), and pseudobulk methods had fast computation times. Then, we consider the top g genes for each method, which are the g genes with the smallest adjusted P-values, and find what percentage of these top genes are known markers. Differential gene expression analysis for multi-subject single-cell RNA To use, simply make a ggplot2-based scatter plot (such as DimPlot() or FeaturePlot()) and pass the resulting plot to HoverLocator(). Default is 0.25. Default is 0.25. 6e), subject and mixed have the same area under the ROC curve (0.82) while the wilcox method has slightly smaller area (0.78). Infinite p-values are set defined value of the highest -log(p) + 100. This research was supported in part through computational resources provided by The University of Iowa, Iowa City, Iowa. sessionInfo()## R version 4.2.0 (2022-04-22) Differential expression testing - Satija Lab ## [85] mime_0.12 formatR_1.14 compiler_4.2.0 Step 4: Customise it! These methods provide interpretable results that generalize to a population of research subjects, account for important sources of biological and technical variability and provide adequate FDR control. Further, applying computational methods that account for all sources of variation will be necessary to gain better insights into biological systems, operating at the granular level of cells all the way up to the level of populations of subjects. Department of Internal Medicine, Roy J. and Lucille A. Help with Volcano plot - Biostar: S ## [64] later_1.3.0 munsell_0.5.0 tools_4.2.0 To measure heterogeneity in expression among different groups, we assume that mean expression for gene iin subject j is influenced by R subject-specific covariates xj1,,xjR.