A Brief Note on Single-Cell RNA-Seq Analysis
Published:
A brief note highlighting the workflow of single-cell RNA-seq data analysis.
Contents
1. Pre-Processing
Goal:
- To pre-process the scRNA-seq count matrices so as to obtain different levels of pre-processed data for different downstream analyses
1.1 Quality Control
Goal:
- To remove outlier barcode data produced by dying cells, broken cells, doublets, etc.
Covariates used for thresholding:
- The number of counts per barcode (count depth)
- The number of genes per barcode
- The fraction of counts from mitochondrial genes per barcode
Notes:
- Consider all covariates jointly
- Revisit QC if downstream analysis is not satisfactory
- Adjust QC thresholds for different samples if necessary
1.2 Normalization
Goal:
- To remove unwanted technical variation coming from count sampling
Notes:
- It’s more common to normalize the count data by cells than by genes
- Normalized data should be log(x+1)-transformed for downstream analysis that assumes data are normally distributed
- Quality-controlled or normalized data can also be used for statistical testing of gene expression
1.3 Data Correction
Goal:
- To further remove unwanted technical and biological variation
Source of variation:
- Technical effects
- Count depth
- Batch
- Between cell groups in an experiment (solution: batch correction)
- Between experiments in a laboratory (solution: data integration)
- Between datasets from different laboratories (solution: data integration)
- Dropout (solution: imputation/denoising/expression recovery)
- Biological effects
- Cell cycle
- Mitochondrial gene expression
Notes:
- Corrected data can also be used for visual comparison of gene expression
1.4 Dimensionality Reduction
Goal:
- To summarize and visualize the data in order to make downstream analysis computationally easier and more intuitive
Notes:
- Typically 1,000 to 5,000 highly variable genes are firstly selected via feature selection
- The feature number is further reduced by dedicated dimensionality reduction algorithms
- Dimensionality reduction methods should be considered separately for summarization and visualization
2. Downstream Analysis
Goal:
- To extract biological insights and describe the underlying biological system
2.1 Cluster Analysis
Goal:
- To explain the heterogeneity in the data based on a categorization of cells into groups
Methods:
- Clustering: to organize cells into clusters
- Compositional analysis: to analyse clustered data in terms of the proportions of cells that fall into each cell-identity cluster
- Cluster annotation: to find marker genes to characterize each cluster and annotate it with a meaningful biological label
2.2 Trajectory Analysis
Goal:
- To regard the data as a snapshot of a dynamic process and investigate this underlying process
- By representing the clusters as nodes and the in-between trajectories as edges, one can represent both the static and dynamic nature of the data
Methods:
- Trajectory inference: to capture transitions between cell identities, branching differentiation processes, or gradual, unsynchronized changes in biological function
- Metastable states identification: to investigate cellular densities along a trajectory, and find dense regions which may represent metastable transcriptomic states
- Gene expression dynamics: to model the genes that vary smoothly across pseudotime in order to characterize the trajectory and identify the underlying biological process
2.3 Gene-Level Analysis
Goal:
- Rather than describing the cellular heterogeneity, gene-level analysis uses this heterogeneity as context in which gene expression is to be understood
Methods:
- Differential expression analysis: to investigate whether any genes are differentially expressed under different experimental conditions
- Gene-set (pathway) analysis: to facilitate the interpretation of long candidate gene lists by grouping the genes into sets based on shared characteristics and testing whether these characteristics are overrepresented in the list
- Gene regulatory network inference: to uncover the regulatory interactions among different genes and small molecules
References
- Current best practices in single-cell RNA-seq analysis: a tutorial. MD Luecken, FJ Theis. Molecular systems biology, 2019
- Computational Methods for Single-Cell Data Analysis. GC Yuan. Springer, 2019
- Orchestrating single-cell analysis with Bioconductor. RA Amezquita, et al. Nature methods, 2019
- Single-cell RNA sequencing technologies and bioinformatics pipelines. B Hwang, JH Lee, D Bang. Experimental & molecular medicine, 2018