Bai_data: Bai et al. Data Sets.

Description

These data are derived from the work of Bai and coworkers(1), who treated mouse fibroblasts with a chemical cocktail to induce differentiation into hepatocyte-like cells. The authors performed single-cell RNA-seq on the treated cells and, on the basis of marker gene expression, identified numerous clusters including fibroblasts, hepatocyte-like cells (CiHep), keratinocyte-like cells (CiKrt) and several other fibroblast-derived cell types.

Recapitulating the computational approach of the original authors, the SRA data set GSM6435750 containing a sparse matrix of counts was downloaded. Using the Seurat package(2), the data were normalized and clusters were identified using Seurat's FindCluster() function and visualized using its RunUMAP() and RunPCA() functions. On the basis of marker genes and comparison with the originally assigned clusters described in the original publication(1). Differential expression was performed comparing gene expression in CiHep cells with Fibroblast 2 cells, as well as CiKrt cells with Fibroblast 2 cells. In this way, the Bai_CiHep_v_Fib2.de and Bai_CiKrt_v_Fib2.de data sets respectively were generated.

Using query gene lists of differentially expressed genes from CiHep vs. Fibroblast 2 and CiKrt vs. Fibroblast 2 comparisons ordered by descending \(\pi\)-value(3) and the tmod package's tmodCERNOtest() function, CERNO(4) was performed against the with the C3 (regulatory gene sets) subset of MSigDB v7.5.1(5) as the gene set collection. In this way, the Bai_CiHep_DN.cerno and Bai_CiKrt_DN.cerno data sets, were generated representing pathways downregulated in CiHep cells vs Fibroblast 2 cells and in CiKrt cells vs Fibroblast 2 cells, respectively.

A gene set collection (GSC) derived from dorothea package(6) version 1.8.0. To generate this GSC, transcription factor to target associations were extracted from the dorothea::dorothea_hs table with confidence ratings of A, B, or C, and 1 (UP or positively regulated) or -1 (negatively regulated) associations. Gene sets were generated for transcription factors consisting of positively (UP) or negatively (DN) associated target genes. Using the Bai count matrix including putative CiHep, and Fibroblast 2 cells as assigned above, GSEA was performed against this dorothea-derived GSC, generating the Bai_CiHep_dorothea_UP.Gsea and Bai_CiHep_dorothea_DN.Gsea data sets including gene sets that were positively and negatively regulated respectively in response to chemical induction. The empty expression matrix (Bai_empty_expr_mat) containing only gene symbols in rows but no columns containing counts was generated from the Bai count matrix.

The Bai_gsc.tmod object was generated by extracting gene sets from the MSigDB C3 subset with adj.P.Val <= 0.05 in Bai_CiHep_DN.cerno and Bai_CiKrt_DN.cerno data sets, and from the dorothea-derived gene set collection in the Bai_CiHep_dorothea_UP.Gsea and Bai_CiHep_dorothea_DN.Gsea data sets with `FDR q-val` <= 0.05.

Usage

Bai_CiHep_DN.cerno
Bai_CiKrt_DN.cerno
Bai_CiHep_dorothea_UP.Gsea
Bai_CiHep_dorothea_DN.Gsea
Bai_CiHep_v_Fib2.de
Bai_CiKrt_v_Fib2.de
Bai_empty_expr_mat
Bai_gsc.tmod

Arguments

Format

Bai_CiHep_DN.cerno: An object of class data.frame with 2895 rows and 8 columns.

ID Gene set ID.
Title Human-interpretable gene set title.
cerno CERNO statistic used in calculating significance.
N1 The number of observable genes in the expression dataset that are also in a given gene set.
AUC AUC value for that gene set.
cES CERNO enrichment score.
P.Value P-value calculated from the cerno statistic using the Fisher method.
adj.P.Val FDR-adjusted.

Bai_CiKrt_DN.cerno: An object of class data.frame with 2895 rows and 8 columns. Same data format as Bai_CiHep_DN.cerno, above.

Bai_CiHep_dorothea_UP.Gsea: An object of class data.frame with 84 rows and 12 columns.

NAME Gene set name/ID.
GS
follow link to MSigDB Same as NAME.
GS DETAILS (Column used only in the HTML version of the data set.)
SIZE The number of genes in the gene set observed in the input data set.
ES GSEA enrichment score.
NES GSEA normalized enrichment score.
NOM p-val GSEA nominal p-value calculated via a permutation test.
FDR q-val GSEA FDR-adjusted q-value.
FWER p-val GSEA FWER-adjusted q-value.
RANK AT MAX Position in the ranked list where maximum enrichment score is found.
LEADING EDGE Three statistics defining "leading edge" subset.
"" Blank column.

Bai_CiHep_dorothea_DN.Gsea: An object of class data.frame with 180 rows and 12 columns. Same data format as Bai_CiHep_dorothea_UP.Gsea.

Bai_CiHep_v_Fib2.de: An object of class data.frame with 8769 rows and 5 columns.

(row names) Gene name.
p_val The calculated p-value for differential expression in Seurat.
avg_log2FC Average log fold change of gene expression.
pct.1 The percentage of cells where the feature is detected in the first group.
pct.2 The percentage of cells where the feature is detected in the second group.
p_val_adj Adjusted p-value.

Bai_CiKrt_v_Fib2.de: An object of class data.frame with 8166 rows and 5 columns. Same data format as Bai_CiHep_v_Fib2.de.

Bai_empty_expr_mat: An object of class matrix (inherits from array) with 20035 rows and 0 columns.

(row name) The gene names in the expression data set.

Bai_gsc.tmod: A tmod class object containing 94 gene sets.

Licensing

Gene sets included herein derived from the the C3 subset of GSEA are protected by copyright (c) 2004-2023 Broad Institute, Inc., Massachusetts Institute of Technology, and Regents of the University of California, and are included here under the terms and conditions of the Creative Commons Attribution 4.0 International License.
Gene sets included herein derived from the dorothea package are included herein under the terms and conditions of the GPL 3+ license.

Author

Jonathan M. Urbach

References

Bai Y, Yang Z, Xu X, Ding W, Qi J, Liu F, et al. Direct chemical induction of hepatocyte‐like cells with capacity for liver repopulation. Hepatology. 2023;77: 1550–1565. doi:10.1002/hep.32686
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36: 411–420. doi:10.1038/nbt.4096
Xiao Y, Hsiao T-H, Suresh U, Chen H-IH, Wu X, Wolf SE, et al. A novel significance score for gene selection and ranking. Bioinformatics. 2014;30: 801–807. doi:10.1093/bioinformatics/btr671
Zyla J, Marczyk M, Domaszewska T, Kaufmann SHE, Polanska J, Weiner J. Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms. Bioinformatics. 2019;35: 5146–5154. doi:10.1093/bioinformatics/btz447
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102: 15545–15550. doi:10.1073/pnas.0506580102
Garcia-Alonso L, Holland CH, Ibrahim MM, Turei D, Saez-Rodriguez J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res. 2019;29: 1363–1375. doi:10.1101/gr.240663.118