This dataset contains protein sequences and their corresponding secondary structures, including beta-sheets (E), helices (H), and coils (_).
Usage
data(protein)
Arguments
Format
A data frame with multiple rows and 2 columns representing protein sequences and their secondary structures.
V1
Amino acid sequence (using 3-letter codes).
V2
Secondary structure of the protein (E for beta-sheet, H for helix, _ for coil).
Details
The dataset is used for predicting protein secondary structures from amino acid sequences. The first few numbers in each sequence are parameters for neural networks and should be ignored. The '<' symbol is used as a spacer between proteins and to mark the beginning and end of sequences.