Learn R Programming

ICBioMark

Welcome to ICBioMark, based on the paper “Data-driven design of targeted gene panels for estimating immunotherapy biomarkers”, Bradley and Cannings (see our preprint).

Installation

You can install the release version of ICBioMark from CRAN with:

install.packages("ICBioMark")

Alternatively, you can install the development version of this package from this github repository (using the devtools package) with:

devtools::install_github("cobrbra/ICBioMark")

Examples

Upon installation we can load the package.

library(ICBioMark)

To demonstrate the typical workflow for using ICBioMark, we play around with a small and friendly example dataset. This is pre-loaded with the package, but just comes from the data simulation function generate_maf_data(), so you can play around with datasets of different sizes and shapes.

Input Data

Our example dataset, called example_maf_data, is a list with two elements: maf and gene_lengths. These are the two pieces of data that you’ll always need to use this package, and they look as follows:

  • maf is a data frame in the MAF (mutation annotated format) style. For a set of sequenced tumour/normal pairs, this means a table with a row for every mutation identified, with columns corresponding to properties such as the sample ID for the tumour of origin, the gene, chromosome and nucelotide location of the mutation, and the type of mutation observed. In the real world, MAF datasets often have lots of extra information beyond this, but in our small example we’ve just included sample, gene and mutation type (it’s all we’ll need!). The top five rows look like this:
 # example_maf_data <- generate_maf_data()

 kable(head(example_maf_data$maf, 5), row.names = FALSE)
Tumor_Sample_BarcodeHugo_SymbolVariant_Classification
SAMPLE_96GENE_14Missense_Mutation
SAMPLE_73GENE_14Frame_Shift_Ins
SAMPLE_55GENE_4Missense_Mutation
SAMPLE_96GENE_3Missense_Mutation
SAMPLE_38GENE_7Missense_Mutation
  • gene_lengths, another data frame, this time containing the names of genes that you’ll want to include in your modelling and their length. Gene length is a complex and subtle thing to define - we advise using coding length as defined in the Ensembl database. For this example, however, gene lengths are again randomly chosen:
  kable(head(example_maf_data$gene_lengths, 5), row.names = FALSE)
Hugo_Symbolmax_cds
GENE_1961
GENE_21009
GENE_31011
GENE_4976
GENE_51016

These are the only two bits of data required to use ICBioMark. Your gene lengths data can contain values for far more genes than are observed in your dataset, and it’s not a huge problem if a couple of genes in a Whole-Exome Sequencing (WES) experiment are missing gene length information, but lots of missing values will cause issues with your model accuracy. Later versions of this package should be able to address missing gene length data.

Train/Val/Test and Matrix Construction

The MAF format is widely used and standardised, but not especially helpful for our purposes. The ideal format for our data is a matrix, where every row corresponds to a sample, every column corresponds to a gene/mutation type combination, and each entry corresponds to how many mutations of that sample, gene and type were sequenced. At the same time as this, we’d like to separate our training data from separately reserved validation and test data. We do this using the function get_mutation_tables().

Before we do this, however, we need to talk about mutation types. Our procedure models different mutation types separately, so in theory one could have separate parameters for each mutation type (e.g. ‘Missense_Mutation’ or ‘Frame_Shift_Ins’). However, doing so will vastly increase the computational complexity of fitting a generative model. It is also not particularly informative to fit parameters to extremely scarce mutation types. We therefore group mutation types together (and can filter some out if we don’t want to include them in our modelling). This will happen behind-the-scenes, but is worth knowing about to understand the outputs generated. Mutations types are grouped and filtered by the function get_mutation_dictionary(). In general we recommend separately modelling indel mutations (so that we can predict TIB later), synonymous mutations (as these don’t count towards TMB or TIB), and lumping together all other nonsynonymous mutation types. The function get_mutation_dictionary() produces a list of mutation types, with labels for their groupings. For example:

  kable(get_mutation_dictionary(), col.names = "Label")
Label
Missense_MutationNS
Nonsense_MutationNS
Splice_SiteNS
Translation_Start_SiteNS
Nonstop_MutationNS
In_Frame_InsNS
In_Frame_DelNS
Frame_Shift_DelI
Frame_Shift_InsI
SilentS
Splice_RegionS
3’FlankS
5’FlankS
IntronS
RNAS
3’UTRS
5’UTRS

We’ve given each mutation type one of three labels: “NS”, “S” and “I”. We could have excluded synonymous mutation types by using get_mutation_dictionary(include_synonymous = FALSE).

Now we can produce our training, validation and test sets (again for this example workflow these are pre-loaded). The object produced has three elements: ‘train’, ‘val’ and ‘test’. Each of these contains a sparse mutation matrix (‘matrix’) and other information describing the contents of the matrix (‘sample_list’, ‘gene_list’, ‘mut_types_list’ and ‘col_names’). We can see that the list of mutation types contains the three labels we specified above

  #example_tables <- get_mutation_tables(example_maf_data$maf, 
  #                   sample_list = paste0("SAMPLE_", 1:100))

  print(example_tables$train$mut_types_list)
#> [1] "NS" "I"  "S"

and that the columns of each matrix correspond to each combination of mutation type and gene:

  print(head(example_tables$train$col_names, 10))
#>  [1] "GENE_1_NS" "GENE_1_I"  "GENE_1_S"  "GENE_2_NS" "GENE_2_I"  "GENE_2_S" 
#>  [7] "GENE_3_NS" "GENE_3_I"  "GENE_3_S"  "GENE_4_NS"

Fitting the Generative Model

There are relatively few decisions left to be made at this point: all we need to do to fit a generative model is to provide gene lengths data and training data to the function fit_gen_model(). We can visualise output of our model with vis_model_fit().

  # example_gen_model <- fit_gen_model(example_maf_data$gene_lengths, 
  #                       table = example_tables$train)

  print(vis_model_fit(example_gen_model))

Since this is a small example, we don’t get a particularly strong signal, but we do see an optimum level of penalisation. (NB: the function vis_model_fit() essentially performs the same task as glmnet::plot.cv.glmnet(). In later versions the glmnet version will hopefully be directly applicable and vis_model_fit() will be redundant).

Fitting the Predictive Model

We now construct a first-fit predictive model. The parameter lambda controls the sparsity of each iteration, so it may take some experimentation to get a good range of panel lengths.

  example_first_pred_tmb <- pred_first_fit(example_gen_model, 
    lambda = exp(seq(-9, -14, length.out = 100)), 
    training_matrix = example_tables$train$matrix, 
    gene_lengths = example_maf_data$gene_lengths)

From this we can construct a range of refit estimators:

  example_refit_range <- pred_refit_range(pred_first = example_first_pred_tmb, 
    gene_lengths = example_maf_data$gene_lengths)

Making Predictions and Analysing Performance

With a predictive model fitted, we can use the function get_predictions() along with a new (validation or test) dataset to produce predictions on that dataset. We then provide several functions including get_stats() to analyse the output compared to true values.

  example_refit_range %>% 
     get_predictions(new_data = example_tables$val) %>% 
     get_stats(biomarker_values = example_tmb_tables$val, model = "T", threshold = 10) %>% 
     ggplot(aes(x = panel_length, y = stat)) + geom_line() + facet_wrap(~metric) + theme_minimal() + labs(x = "Panel Length", y = "Predictive Performance")

Getting Help

Please do feel free to flag issues and requests on this repository. You can also email me.

Copy Link

Version

Install

install.packages('ICBioMark')

Monthly Downloads

232

Version

0.1.4

License

MIT + file LICENSE

Maintainer

Jacob R. Bradley

Last Published

November 15th, 2021

Functions in ICBioMark (0.1.4)

get_auprc

AUPRC Metrics for Predictions
fit_gen_model_uninteract

Fit Generative Model Without Gene/Variant Type-Specific Interactions
fit_gen_model_unisamp

Fit Generative Model Without Sample-Specific Effects
example_tib_tables

Tumour Indel Burden of Example Train, Validation and Test Data.
generate_maf_data

Generate mutation data.
get_r_squared

R Squared Metrics for Predictions
get_K

Construct Bias Penalisation
get_biomarker_tables

Get True Biomarker Values on Training, Validation and Test Sets
get_predictions

Produce Predictions on an Unseen Dataset
get_gen_estimates

Investigate Generative Model Comparisons
pred_refit_panel

Refitted Predictive Model for a Given Panel
get_biomarker_from_maf

Produce a Table of Biomarker Values from a MAF
example_tmb_tables

Tumour Mutation Burden of Example Train, Validation and Test Data.
get_mutation_dictionary

Group and Filter Mutation Types
get_stats

Metrics for Predictive Performance
nsclc_maf

Non-Small Cell Lung Cancer MAF Data
nsclc_survival

Non-Small Cell Lung Cancer Survival and Clinical Data
fit_gen_model

Fit Generative Model
pred_intervals

Produce Error Bounds for Predictions
get_mutation_tables

Produce Training, Validation and Test Matrices
pred_first_fit

First-Fit Predicitve Model with Group Lasso
get_panels_from_fit

Extract Panel Details from Group Lasso Fit
get_p

Construct Optimisation Parameters.
pred_refit_range

Get Refitted Predictive Models for a First-Fit Range of Panels
vis_model_fit

Visualise Generative Model Fit
get_table_from_maf

Produce a Mutation Matrix from a MAF
example_maf_data

Simulated MAF Data
ICBioMark

ICBioMark: A package for cost-effective design of gene panels to predict exome-wide biomarkers.
example_tables

Mutation Matrices from Simulated Data
example_gen_model

Generative Model from Simulated Data
example_refit_panel

Refitted Predictive Model Fitted on Example Data
example_predictions

Example Predictions
ensembl_gene_lengths

Gene Lengths from the Ensembl Database
example_first_pred_tmb

First-Fit Predictive Model Fitting on Example Data
example_refit_range

Refitted Predictive Models Fitted on Example Data