This function parses BLASTP result tables to extract structured genome, contig, ORF, and gene information from the query and subject identifiers. It is designed for downstream analyses requiring explicit separation of genome, contig, and ORF identifiers from concatenated BLAST headers.
orf_extract(bin_genes = blastp_df)The original data frame with six additional columns:
genomeGenome identifier extracted from qaccver.
contigContig identifier extracted from qaccver.
orfFull ORF identifier extracted from qaccver.
genome_contigConcatenated genome and contig IDs (genome---contig).
geneGene symbol extracted from saccver.
orf_positionNumeric ORF position extracted from the ORF identifier.
A data frame containing BLASTP results with at least 2 standard columns:
qaccver, saccver.
the column of qaccver should include both of the genome name and predicted contig name, which is concatenated by a separator "---".
for example, for the qaccver "p__Myxococcota--c__Kuafubacteria--o__Kuafubacteriales--f__Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_150",
the genome name is "p__Myxococcota--c__Kuafubacteria--o__Kuafubacteriales--f__Kuafubacteriaceae--GCA_016703535.1",
the contig name is "JADJBV010000001.1", the orf name is "JADJBV010000001.1_150", and the orf_position is "150".
the column of saccver must include the gene name and may include the gene information, which are concatenated by a separator "_".
for example, for the saccver "bchC_Methyloversatilis_sp_RAC08_BSY238_2447_METR",
the gene name is "bchC", the gene information is Methyloversatilis_sp_RAC08_BSY238_2447_METR that can help you understand the source of gene.