gc_cal: Identify and Extract Gene Clusters from Scaled BLAST Data

Description

This function screens contigs for regions that contain a pre-defined set of “reference” genes (e.g., photosynthetic genes, viral genes) arranged in a continuous block. Contigs are first coarsely filtered by the minimum number of reference genes they carry, then finely scanned for clusters that satisfy user- defined density and contiguity criteria. Each detected cluster is returned with a unique gene_cluster identifier.

Usage

gc_cal(
  Data = bin_genes,
  in_gene_list = photosynthesis_gene_list,
  AllGeneNum = 30,
  MinConSeq = 15
)

Value

A data frame identical in structure to Data but filtered to contain only those rows that belong to valid clusters. An extra column gene_cluster (format: genome_contig---N) is added to uniquely label every cluster.

Arguments

Data: A data frame produced by orf_extract (i.e., a scaled BLAST table). Must include the columns genome_contig, gene, and orf_position.
in_gene_list: A character vector of “reference” gene symbols (e.g., photosynthesis_gene_list) that are expected to appear in the target cluster(s).
AllGeneNum: Integer. Maximum total ORF count (annotated plus hypothetical) that the algorithm is allowed to span when defining a cluster (default: 30).
MinConSeq: Integer. Minimum number of reference genes that must be present and consecutive within the candidate cluster (default: 15). Must satisfy 1 <= MinConSeq <= AllGeneNum.

Details

Coarse filter: Contigs with fewer than MinConSeq reference genes are discarded.
Fine scan: For each remaining contig, the algorithm slides a window that can encompass up to AllGeneNum consecutive ORFs and retains windows that contain at least MinConSeq reference genes in uninterrupted order.
Cluster labelling: Each valid cluster receives a unique ID (genome_contig---1, genome_contig---2, …).