This function screens contigs for regions that contain a
pre-defined set of “reference” genes (e.g., photosynthetic genes, viral genes)
arranged in a continuous block. Contigs are
first coarsely filtered by the minimum number of reference genes
they carry, then finely scanned for clusters that satisfy user-
defined density and contiguity criteria. Each detected cluster
is returned with a unique gene_cluster identifier.
gc_cal(
Data = bin_genes,
in_gene_list = photosynthesis_gene_list,
AllGeneNum = 30,
MinConSeq = 15
)A data frame identical in structure to Data but filtered to
contain only those rows that belong to valid clusters. An extra
column gene_cluster (format: genome_contig---N) is added
to uniquely label every cluster.
A data frame produced by orf_extract (i.e., a scaled
BLAST table). Must include the columns genome_contig,
gene, and orf_position.
A character vector of “reference” gene symbols (e.g.,
photosynthesis_gene_list) that are expected
to appear in the target cluster(s).
Integer. Maximum total ORF count (annotated plus hypothetical) that the algorithm is allowed to span when defining a cluster (default: 30).
Integer. Minimum number of reference genes that must be
present and consecutive within the candidate cluster
(default: 15). Must satisfy 1 <= MinConSeq <= AllGeneNum.
Coarse filter: Contigs with fewer than MinConSeq reference
genes are discarded.
Fine scan: For each remaining contig, the algorithm slides a
window that can encompass up to AllGeneNum consecutive ORFs
and retains windows that contain at least MinConSeq reference
genes in uninterrupted order.
Cluster labelling: Each valid cluster receives a unique ID
(genome_contig---1, genome_contig---2, …).