protflow.utils package
Submodules
protflow.utils.biopython_tools module
This module provides a collection of utilities for working with BioPython, specifically designed to facilitate the analysis and manipulation of protein structures and sequences. The functionalities included in this module allow users to load, save, and superimpose protein structures from PDB files, as well as extract and analyze sequences from these structures.
Overview
The module encompasses a range of tools to handle various tasks related to protein structural data. Users can load structures from PDB files, save structures back to PDB files, and perform structural superimpositions based on specific atoms or motifs. Additionally, it provides methods to extract sequences from protein structures, renumber residues, and add chains to structures. For sequence analysis, it includes functionalities to load sequences from FASTA files and compute various protein properties using the Bio.SeqUtils.ProtParam class.
Examples
Here are some examples of how to use the functions provided in this module:
Loading a structure from a PDB file:
from biopython_tools import load_structure_from_pdbfile structure = load_structure_from_pdbfile("example.pdb")
Saving a structure to a PDB file:
from biopython_tools import save_structure_to_pdbfile save_structure_to_pdbfile(structure, "output.pdb")
Superimposing one structure onto another:
from biopython_tools import superimpose superimposed_structure = superimpose(mobile_structure, target_structure)
Extracting sequence from a protein structure:
from biopython_tools import get_sequence_from_pose sequence = get_sequence_from_pose(structure)
Loading a sequence from a FASTA file:
from biopython_tools import load_sequence_from_fasta sequence_record = load_sequence_from_fasta("example.fasta")
Calculating protein parameters from a sequence:
from biopython_tools import determine_protparams parameters = determine_protparams(sequence_record.seq)
These examples illustrate the primary capabilities of the module, showcasing how it can be utilized to streamline the process of working with protein structures and sequences in BioPython.
- protflow.utils.biopython_tools.add_chain(target, reference, copy_chain, translate_x=None, overwrite=False)[source]
Add a chain from a reference structure to a target structure, optionally translating it and handling ID conflicts.
- Parameters:
target (
Structure) – The structure to which the chain will be added.reference (
Structure) – The structure from which the chain will be copied.copy_chain (
str) – The chain ID in the reference structure to be copied.translate_x (
float, optional) – The distance by which to translate the new chain along the x-axis (default is None).overwrite (
bool, optional) – Whether to overwrite the chain in the target structure if a chain with the same ID already exists. If False and a conflict occurs, a new unique chain ID will be generated (default is True).
- Returns:
The updated target structure with the added chain.
- Return type:
Structure
- protflow.utils.biopython_tools.biopython_load_protein(protein_path, model_id=None, handle='structure', file_type=None)[source]
TODO: write proper docstring! Loads proteins into biopython Structure/Model objects, irrespective of .pdb or .cif format. :file_type: parameter allows to specify explicity which loader should be used. can be {‘cif’, ‘pdb’, None}
- protflow.utils.biopython_tools.determine_protparams(seq, pH=7)[source]
Calculate protein features based on a sequence.
This function calculates various protein properties from an input sequence using BioPython’s Bio.SeqUtils.ProtParam class. The results are returned in a pandas DataFrame.
Parameters:
- seqUnion[str, Bio.SeqRecord.SeqRecord, Bio.Seq.Seq]
The input sequence for which the protein properties will be calculated. The input can be a string, SeqRecord, or Seq object.
- pHfloat, optional
The pH value at which to calculate the protein’s charge. Defaults to 7.
Returns:
- pandas.DataFrame
A DataFrame containing the calculated protein properties, including molecular weight, aromaticity, GRAVY, isoelectric point, molar extinction coefficients, flexibility, secondary structure fraction, and charge at the specified pH.
Example:
Calculate protein properties for a given sequence:
from biopython_tools import determine_protparams from Bio.Seq import Seq # Define a protein sequence sequence = Seq("MSTHRRRPQEAAGRVNRLPGTPLARAKYFYPKPGERKVEQTPWFAWDVTAGNEYEDTIEFRLEAEGKVGEVVEREDPDNGRGNFARFSLGLYGSKTQYRLPFTVEEVFHDLESVTQKDGFWNCTAFRTVQRLPRTRVAAELNPRAKAAASAVFTFQSQDVDAVANAVEACFAGFYEVVGVFVSNAVDGSVAGAQNFSQFCVGFRGGPRMLRQNRAPATFASAGNHPAKVLAACGLRYAA") # Calculate properties properties_df = determine_protparams(sequence) # Print properties print(properties_df)
Notes:
The function supports input sequences in various formats, including strings, SeqRecord, and Seq objects.
- The calculated properties include:
Molecular weight
Aromaticity
GRAVY (Grand Average of Hydropathy)
Isoelectric point
Molar extinction coefficient (reduced and oxidized cysteines)
Flexibility
Secondary structure fraction (helix, turn, sheet)
Charge at the specified pH
The function raises a TypeError if the input sequence is not in a recognized format.
- protflow.utils.biopython_tools.get_atoms(structure, atoms, chains=None, include_het_atoms=False)[source]
Extract specific atoms from one or more chains in a Biopython
Structure.Parameters:
- structure (Bio.PDB.Structure.Structure):
The input structure that will be searched.
- atoms (list[str]):
Atom names to extract (e.g.
["N", "CA", "C", "O"]). If an empty list orNoneis supplied, all atoms in each selected residue are returned.- chains (list[str], optional):
Chain identifiers to restrict the search (e.g.
["A", "B"]). When None (default), every chain in structure is processed.- include_het_atoms (bool, optional):
If
True, hetero-atoms and ligands are included in addition to standard amino-acid residues. Defaults toFalse(only ATOM records whereresidue.id[0] == " ").
Returns:
- list[Bio.PDB.Atom.Atom]:
A list of
Atomobjects matching the query, in the order they appear in the structure.
- protflow.utils.biopython_tools.get_atoms_of_motif(pose, motif, atoms=None, excluded_atoms=None, exclude_hydrogens=True, include_het_atoms=False)[source]
Select atoms from a structure based on a provided motif.
This function extracts atoms from a structure based on a specified motif, which is defined by a list of residues. The user can specify which atoms to include or exclude, and whether to exclude hydrogen atoms.
Parameters:
- poseBio.PDB.Structure
The BioPython Structure object from which atoms are to be extracted.
- motifResidueSelection
A selection of residues defining the motif from which atoms will be extracted.
- atomslist of str, optional
A list of atom names to extract from the residues. If not provided, all atoms in the residues are extracted.
- excluded_atomslist of str, optional
A list of atom names to exclude from the extraction. Defaults to [“H”, “NE1”, “OXT”].
- exclude_hydrogensbool, optional
If True, hydrogen atoms are excluded from the extraction. Defaults to True.
Returns:
- list
A list of Bio.PDB.Atom objects corresponding to the specified motif and atom selection.
Example:
Extract CA atoms from a specified motif:
from biopython_tools import get_atoms_of_motif, load_structure_from_pdbfile # Load structure structure = load_structure_from_pdbfile("example.pdb") # Define motif (example) motif = [("A", 10), ("A", 11), ("A", 12)] # Get CA atoms from the motif atoms = get_atoms_of_motif(structure, motif, atoms=["CA"])
Notes:
The function allows for flexibility in defining the selection of atoms based on the motif.
If exclude_hydrogens is True, hydrogen atoms will not be included in the output list.
- protflow.utils.biopython_tools.get_next_chain_id(existing_ids)[source]
Generate the next available chain ID for a protein structure.
Chain IDs are generated as single letters from ‘A’ to ‘Z’ and, if all single-letter chain IDs are occupied, as double-letter combinations from ‘AA’ to ‘ZZ’.
- Parameters:
existing_ids (
iterable) – A collection of existing chain IDs to avoid.- Returns:
The next available chain ID that is not in existing_ids.
- Return type:
- Raises:
ValueError – If no available chain IDs are found.
- protflow.utils.biopython_tools.get_sequence_from_pose(pose, chain_sep=':', with_chains=False, sort_residues=True, custom_one_letter=None)[source]
Extracts the sequence of peptides from a protein structure.
Parameters: - pose (Bio.PDB.Structure.Structure): A BioPython Protein Data Bank (PDB) structure object containing the protein’s atomic coordinates. - chain_sep (str, optional): Separator used to join the sequences of individual peptides. Default is “:”. - with_chains (bool, optional): Returns a dictionary with chain ids as keys and sequences as values instead of a single str. Default is False. - sort_residues (bool, optional): Sort residues on each chain according to residue number, not occurrence in input .pdb file. Default is True. - custom_one_letter (dict, optional): Assign a custom one-letter code to a non-standard residue 3-letter code in the form {“B”: “BAA”}. Default is None.
Returns: - str: The concatenated sequence of peptides, separated by the specified separator.
Description: This function takes a BioPython PDB structure object ‘pose’ and extracts the sequences of individual peptides within the structure using the PPBuilder from BioPython. It then joins these sequences into a single string, using the ‘chain_sep’ as a separator. The resulting string represents the concatenated sequence of peptides in the protein structure.
Example: >>> structure = Bio.PDB.PDBParser().get_structure(“example”, “example.pdb”) >>> sequence = get_sequence_from_pose(structure, “-“) >>> print(sequence) ‘MSTHRRRPQEAAGRVNRLPGTPLARAKYFYPKPGERKVEQTPWFAWDVTAGNEYEDTIEFRLEAEGKVGEVVEREDPDNGRGNFARFSLGLYGSKTQYRLPFTVEEVFHDLESVTQKDGFWNCTAFRTVQRLPRTRVAAELNPRAKAAASAVFTFQSQDVDAVANAVEACFAGFYEVVGVFVSNAVDGSVAGAQNFSQFCVGFRGGPRMLRQNRAPATFASAGNHPAKVLAACGLRYAA…
- protflow.utils.biopython_tools.load_sequence_from_fasta(fasta, return_multiple_entries=True)[source]
Load a sequence from a FASTA file.
This function imports a FASTA file and returns a sequence record or a record iterator depending on the number of entries and the specified options.
Parameters:
- fastastr
Path to the FASTA file to be imported.
- return_multiple_entriesbool, optional
If True, returns a record iterator for multiple entries. If False, returns a single record even if the file contains multiple entries. Defaults to True.
Returns:
- Bio.SeqRecord.SeqRecord or iterator
A single SeqRecord object if the file contains one entry or return_multiple_entries is False. Otherwise, a record iterator for multiple entries.
Example:
Load a sequence from a single-entry FASTA file:
from biopython_tools import load_sequence_from_fasta # Load sequence from FASTA file sequence_record = load_sequence_from_fasta("example.fasta")
Load sequences from a multi-entry FASTA file:
from biopython_tools import load_sequence_from_fasta # Load sequences from multi-entry FASTA file sequence_iterator = load_sequence_from_fasta("multi_example.fasta")
Notes:
The function utilizes Bio.SeqIO.parse to read the FASTA file and determine the number of entries.
If return_multiple_entries is set to True and the file contains multiple entries, an iterator is returned to handle the sequences.
- protflow.utils.biopython_tools.load_structure_from_pdbfile(path_to_pdb, all_models=False, model=0, quiet=True, handle=None)[source]
Load a structure from a PDB file using BioPython’s PDBParser.
This function parses a PDB file and returns a structure object. It allows the option to load all models from the PDB file or a specific model.
Parameters:
- path_to_pdb (str):
Path to the PDB file to be parsed.
- all_models (bool, optional):
If True, all models from the PDB file are returned. If False, only the specified model is returned. Defaults to False.
- model (int, optional):
The index of the model to return. Only used if all_models is False. Defaults to 0 (first model).
- quiet (bool, optional):
If True, suppresses output from the PDBParser. Defaults to True.
- handle (str, optional):
String handle that is passed to the PDBParser’s get_structure() method and sets the id of the structure.
Returns:
- Bio.PDB.Structure:
The parsed structure object from the PDB file. If all_models is True, returns a Structure containing all models. Otherwise, returns a single Model object at the specified index.
Raises:
- FileNotFoundError:
If the specified PDB file does not exist.
- ValueError:
If the specified model index is out of range for the PDB file.
Example:
To load the first model from a PDB file: >>> structure = load_structure_from_pdbfile(“example.pdb”)
To load all models from a PDB file: >>> all_structures = load_structure_from_pdbfile(“example.pdb”, all_models=True)
- protflow.utils.biopython_tools.one_to_three_AA_code(seq, custom_map=None, undef_code='X')[source]
Converts a sequence in 1-letter code to 3-letter code.
This function converts an input sequence in 1-letter code to 3-letter code using BioPython’s Bio.SeqUtils functions. The results are returned in a string.
Parameters:
- seqUnion[str, Bio.SeqRecord.SeqRecord, Bio.Seq.Seq]
The input sequence in 1-letter code. The input can be a string, SeqRecord, or Seq object.
- custom_mapdict, optional
Use a custom 3-letter code for a given 1-letter code (e.g. for noncanonical residues).
- undef_code: str, optional
Replace all unknown 1-letter codes (e.g. from ligands or noncannonical residues) with this string.
Returns:
- str
A string of all residues in the sequence in one-letter code
Example:
Convert 1-letter code to 3-letter code:
from biopython_tools import one_to_three_AA_code # Define a protein sequence sequence = "HAW") # Calculate properties threeletter_seq = one_to_three_AA_code(sequence) # Print properties print(threeletter_seq)
Notes:
The function supports input sequences in various formats, including strings, SeqRecord, and Seq objects.
- protflow.utils.biopython_tools.remove_non_residue_residues(model, remove_hydrogens=False)[source]
Removes non-residue residues from a BioPython Model object, keeping only standard amino acids and modified amino acids.
- Parameters:
model (Model)
remove_hydrogens (bool)
- Return type:
Model
- protflow.utils.biopython_tools.renumber_pdb_by_residue_mapping(pose_path, residue_mapping, out_pdb_path=None, keep_chain='', overwrite=False)[source]
Renumber the residues of a BioPython structure based on a residue mapping.
This function renumbers the residues in a BioPython structure according to a specified mapping. The mapping defines the old and new residue identifiers. The user can choose to keep a specific chain unchanged.
Parameters:
- poseBio.PDB.Structure
The BioPython Structure object whose residues will be renumbered.
- residue_mappingdict
A dictionary mapping old residue identifiers to new identifiers. Format: {(old_chain, old_id): (new_chain, new_id), …}.
- keep_chainstr, optional
The identifier of a chain to keep unchanged. Defaults to an empty string.
Returns:
- Bio.PDB.Structure
The renumbered structure.
Example:
Renumber residues in a structure based on a mapping:
from biopython_tools import renumber_pose_by_residue_mapping, load_structure_from_pdbfile # Load structure structure = load_structure_from_pdbfile("example.pdb") # Define residue mapping (example) residue_mapping = {("A", 10): ("A", 20), ("A", 11): ("A", 21)} # Renumber residues in the structure renumbered_structure = renumber_pose_by_residue_mapping(structure, residue_mapping)
Notes:
The function creates a deep copy of the input structure and applies the residue renumbering to the copy.
The keep_chain parameter allows for retaining the original numbering of a specified chain.
- protflow.utils.biopython_tools.renumber_pose_by_residue_mapping(pose, residue_mapping, keep_chain='')[source]
Renumber the residues of a BioPython structure based on a residue mapping.
This function renumbers the residues in a BioPython structure according to a specified mapping. The mapping defines the old and new residue identifiers. The user can choose to keep a specific chain unchanged.
Parameters:
- poseBio.PDB.Structure
The BioPython Structure object whose residues will be renumbered.
- residue_mappingdict
A dictionary mapping old residue identifiers to new identifiers. Format: {(old_chain, old_id): (new_chain, new_id), …}.
- keep_chainstr, optional
The identifier of a chain to keep unchanged. Defaults to an empty string.
Returns:
- Bio.PDB.Structure
The renumbered structure.
Example:
Renumber residues in a structure based on a mapping:
from biopython_tools import renumber_pose_by_residue_mapping, load_structure_from_pdbfile # Load structure structure = load_structure_from_pdbfile("example.pdb") # Define residue mapping (example) residue_mapping = {("A", 10): ("A", 20), ("A", 11): ("A", 21)} # Renumber residues in the structure renumbered_structure = renumber_pose_by_residue_mapping(structure, residue_mapping)
Notes:
The function creates a deep copy of the input structure and applies the residue renumbering to the copy.
The keep_chain parameter allows for retaining the original numbering of a specified chain.
- protflow.utils.biopython_tools.save_structure_to_pdbfile(pose, save_path, multimodel=False)[source]
Save a BioPython structure object to a PDB file.
This function takes a BioPython Structure object and writes it to a specified file in PDB format. It is useful for saving modified structures or for converting structures into PDB files for further analysis or visualization.
Parameters:
- poseBio.PDB.Structure
The BioPython Structure object to be saved.
- save_pathstr
The file path where the PDB file will be written. The file will be created if it does not exist, or overwritten if it does.
- multimodelbool
If the structure to be saved is a multimodel PDB file, write all models. Only works if input is a Structure object, not a model!
Returns:
None
Raises:
- IOError
If the file cannot be written to the specified path.
Example:
Save a BioPython structure to a PDB file:
from biopython_tools import save_structure_to_pdbfile from Bio.PDB import PDBParser # Load a structure using BioPython's PDBParser parser = PDBParser() structure = parser.get_structure("example", "example.pdb") # Save the structure to a new PDB file save_structure_to_pdbfile(structure, "output.pdb")
- protflow.utils.biopython_tools.sort_residues_on_chain(pose)[source]
Sorts all residues on each chain according to residue number.
- Parameters:
pose (Model | Structure)
- protflow.utils.biopython_tools.split_complex(path, work_dir, ligand_name)[source]
Split a structure file into a ligand SDF and a protein PDB.
Handles both PDB and CIF inputs. Only the residue matching
ligand_nameis written to SDF; all ATOM residues are written to PDB.
- protflow.utils.biopython_tools.superimpose(mobile, target, mobile_atoms=None, target_atoms=None)[source]
Superimpose a mobile structure onto a target structure based on specified atoms.
This function performs structural superimposition of a mobile protein structure onto a target protein structure. The superimposition can be based on specified lists of atoms. If no specific atoms are provided, the superimposition is based on the backbone atoms (N, CA, O).
Parameters:
- mobileBio.PDB.Structure
The BioPython Structure object representing the mobile structure to be superimposed.
- targetBio.PDB.Structure
The BioPython Structure object representing the target structure.
- mobile_atomslist, optional
A list of atoms from the mobile structure to be used for superimposition. If not provided, defaults to the backbone atoms.
- target_atomslist, optional
A list of atoms from the target structure to be used for superimposition. If not provided, defaults to the backbone atoms.
Returns:
- Bio.PDB.Structure
The mobile structure after superimposition onto the target structure.
Example:
Superimpose a mobile structure onto a target structure based on backbone atoms:
from biopython_tools import superimpose, load_structure_from_pdbfile # Load structures mobile_structure = load_structure_from_pdbfile("mobile.pdb") target_structure = load_structure_from_pdbfile("target.pdb") # Superimpose mobile structure onto target structure superimposed_structure = superimpose(mobile_structure, target_structure)
Notes:
If no specific atoms are provided, the function defaults to using the backbone atoms (N, CA, O) for superimposition.
The superimposed structure is modified in place and returned.
- protflow.utils.biopython_tools.superimpose_on_motif(mobile, target, mobile_atoms=None, target_atoms=None, atom_list=None)[source]
Superimpose a mobile structure onto a target structure based on specified motifs or atom lists.
This function performs structural superimposition of a mobile protein structure onto a target protein structure. The superimposition can be based on specified motifs or lists of atoms. If no specific atoms are provided, the superimposition is based on the alpha carbon (CA) atoms.
Parameters:
- mobileBio.PDB.Structure
The BioPython Structure object representing the mobile structure to be superimposed.
- targetBio.PDB.Structure
The BioPython Structure object representing the target structure.
- mobile_atomsResidueSelection, optional
A selection of residues from the mobile structure to be used for superimposition. If not provided, defaults to the backbone atoms.
- target_atomsResidueSelection, optional
A selection of residues from the target structure to be used for superimposition. If not provided, defaults to the backbone atoms.
- atom_listlist of str, optional
A list of atom names to use for the superimposition. If not provided, defaults to [“N”, “CA”, “O”].
Returns:
- Bio.PDB.Structure
The mobile structure after superimposition onto the target structure.
Example:
Superimpose a mobile structure onto a target structure based on CA atoms:
from biopython_tools import superimpose_on_motif, load_structure_from_pdbfile # Load structures mobile_structure = load_structure_from_pdbfile("mobile.pdb") target_structure = load_structure_from_pdbfile("target.pdb") # Superimpose mobile structure onto target structure superimposed_structure = superimpose_on_motif(mobile_structure, target_structure)
Notes:
If no specific atoms or motifs are provided, the function defaults to using the backbone atoms (N, CA, O) for superimposition.
The superimposed structure is modified in place and returned.
- Parameters:
mobile (Structure)
target (Structure)
mobile_atoms (ResidueSelection)
target_atoms (ResidueSelection)
- Return type:
Structure
- protflow.utils.biopython_tools.three_to_one_AA_code(seq, custom_map=None, undef_code='X')[source]
Converts a sequence in 3-letter code to 1-letter code.
This function converts an input sequence in 3-letter code to 1-letter code using BioPython’s Bio.SeqUtils functions. The results are returned in a string.
Parameters:
- seqUnion[str, Bio.SeqRecord.SeqRecord, Bio.Seq.Seq]
The input sequence in 3-letter code. The input can be a string, SeqRecord, or Seq object.
- custom_mapdict, optional
Use a custom 1-letter code for a given 3-letter code (e.g. for noncanonical residues).
- undef_code: str, optional
Replace all unknown 3-letter codes (e.g. from ligands or noncannonical residues) with this string.
Returns:
- str
A string of all residues in the sequence in one-letter code
Example:
Convert 3-letter code to 1-letter code:
from biopython_tools import three_to_one_AA_code # Define a protein sequence sequence = "HisAlaTrp") # Calculate properties oneletter_seq = three_to_one_AA_code(sequence) # Print properties print(oneletter_seq)
Notes:
The function supports input sequences in various formats, including strings, SeqRecord, and Seq objects.
- protflow.utils.biopython_tools.translate_entity(entity, vector)[source]
Translates all atom coordinates in the given entity by the specified vector.
Parameters:
- entityBio.PDB.Entity.Entity
The entity (Structure, Model, Chain, or Residue) whose coordinates will be translated.
- vectorarray-like of shape (3,)
The translation vector (dx, dy, dz).
Returns:
- None
The entity is modified in place.
- Parameters:
entity (<module 'Bio.PDB.Entity' from '/home/docs/checkouts/readthedocs.org/user_builds/protflow/envs/latest/lib/python3.11/site-packages/Bio/PDB/Entity.py'>)
vector (array)
- Return type:
None
protflow.utils.metrics module
This module provides a comprehensive suite of tools for calculating various metrics related to proteins and their sequences, specifically designed to facilitate detailed analysis and comparisons. The functionalities in this module allow users to determine mutations between protein sequences, calculate structural metrics such as the radius of gyration, and compute sequence identities, among other tasks.
Overview: The module encompasses a range of utilities aimed at analyzing protein structures and sequences. Users can compare protein sequences to identify mutations, calculate the radius of gyration from PDB files, and assess sequence identity both pairwise and across multiple sequences. Additionally, it provides methods for evaluating protein entropy, ligand interactions, and structural consistency metrics such as self-consistency TM-score and Baker in silico success scores.
Key functionalities include: - Mutation Analysis: Tools to list and count mutations between wild-type and variant sequences, and to identify mutation indices. - Structural Metrics: Calculation of the radius of gyration for protein structures and evaluation of ligand clashes and contacts. - Sequence Analysis: Computation of sequence identity between two sequences and across a list of sequences. - Entropy Calculation: Calculation of entropy based on a given probability distribution. - Self-Consistency and Success Scores: Methods to compute self-consistency TM-scores and Baker in silico success scores within dataframes. - Ligand Interaction Analysis: Evaluation of ligand clashes and contacts within a protein structure.
Examples: Here are some examples of how to use the functions provided in this module:
- Counting mutations between two protein sequences:
`python from metrics import count_mutations mutation_count, mutations = count_mutations("ACDEFG", "ACDQFG") print(mutation_count, mutations) # Output: 1, ['E4Q'] `
- Calculating radius of gyration from a PDB file:
`python from metrics import calc_rog_of_pdb rog = calc_rog_of_pdb("example.pdb") print(rog) `
- Finding mutation indices between two sequences:
`python from metrics import get_mutation_indeces indices = get_mutation_indeces("ACGTAGCT", "ACCTAGCT") print(indices) # Output: [3] `
- Calculating sequence identity between two sequences:
`python from metrics import calc_sequence_identity identity = calc_sequence_identity("ACDEFG", "ACDQFG") print(identity) # Output: 0.8333333333333334 `
- Computing all-against-all sequence identity for a list of sequences:
`python from metrics import all_against_all_sequence_identity identities = all_against_all_sequence_identity(["ACDEFG", "ACDFGG", "ACDEFG"]) print(identities) # Output: [0.8333333333333334, 0.8333333333333334, 1.0] `
- Calculating entropy from a probability distribution:
`python from metrics import entropy prob_dist = np.array([0.1, 0.2, 0.7]) ent = entropy(prob_dist) print(ent) # Output: 1.1567796494470395 `
- Calculating self-consistency TM-score in a dataframe:
`python from metrics import calc_sc_tm df = pd.DataFrame({"ref_col": ["A", "B"], "tm_col": [0.9, 0.85]}) updated_df = calc_sc_tm(df, "sc_tm_score", "ref_col", "tm_col") print(updated_df) `
These examples illustrate the primary capabilities of the module, showcasing how it can be utilized to streamline the process of analyzing protein structures and sequences.
- protflow.utils.metrics.all_against_all_sequence_identity(input_seqs)[source]
Calculate the maximum sequence identity for all sequences against each other.
This function takes a list of protein sequences and computes the maximum sequence identity for each sequence against all others in the list.
- Parameters:
- Returns:
A list of maximum sequence identities for each sequence against all others.
- Return type:
Example
>>> from metrics import all_against_all_sequence_identity >>> identities = all_against_all_sequence_identity(["ACDEFG", "ACDFGG", "ACDEFG"]) >>> print(identities) # Output: [0.8333333333333334, 0.8333333333333334, 1.0]
- protflow.utils.metrics.calc_interchain_contacts(pose, chains, contact_bounds=(4, 8), atoms=None)[source]
Calculates contacts between chains in pose
- protflow.utils.metrics.calc_interchain_contacts_pdb(pdb_path, chains, contact_bounds=(4, 8), atoms=None)[source]
Calculates interchain contacts in pose for .pdb file
- protflow.utils.metrics.calc_ligand_clashes(pose, ligand_chain, dist=3, atoms=None, exclude_ligand_hydrogens=False)[source]
Calculate ligand clashes for a PDB file given a ligand chain.
This method calculates the number of clashes between a specified ligand chain and the rest of the structure in a PDB file or a Bio.PDB Structure object. A clash is defined as any pair of atoms (one from the ligand, one from the rest of the structure) that are within a specified distance of each other.
- Parameters:
pose (
str | Bio.PDB.Structure.Structure) – The pose representing the structure, which can be a path to a PDB file (str) or a Bio.PDB Structure object.ligand_chain (
str) – The chain identifier for the ligand within the structure.dist (
float, optional) – The distance threshold for defining a clash. Default is 3.0.atoms (
list[str], optional) – A list of atom names to consider for clash calculations. If None, all atoms are considered. If specified, only these atoms will be included in the clash calculation.exclude_ligand_hydrogens (bool)
- Returns:
The number of clashes found between the ligand and the rest of the structure.
- Return type:
Examples
Here is an example of how to use the calc_ligand_clashes method:
from Bio.PDB import PDBParser # Load structure from a PDB file parser = PDBParser() structure = parser.get_structure("example", "example.pdb") # Calculate clashes clashes = calc_ligand_clashes(structure, ligand_chain="A", dist=3.0, atoms=["N", "CA", "C"]) # clashes will be a float representing the number of clashes
- Further Details:
Clash Calculation: The method calculates the Euclidean distance between all specified atoms of the ligand chain and the rest of the structure. A clash is counted if the distance is less than the specified threshold.
Usage: This function is useful for evaluating potential steric clashes in molecular docking studies or for validating the positioning of ligands in structural models.
This method is designed to facilitate the detection of steric clashes between ligands and the surrounding structure, providing a quantitative measure of potential conflicts.
- protflow.utils.metrics.calc_ligand_clashes_vdw(pose, ligand_chain, factor=1, atoms=None, exclude_ligand_elements=None)[source]
Calculate ligand clashes for a PDB file given a ligand chain.
This method calculates the number of clashes between a specified ligand chain and the rest of the structure in a PDB file or a Bio.PDB Structure object. A clash is defined as any pair of atoms (one from the ligand, one from the rest of the structure) that are within the sum of their Van der Waals radii multiplied by a factor.
- Parameters:
pose (
str | Bio.PDB.Structure.Structure) – The pose representing the structure, which can be a path to a PDB file (str) or a Bio.PDB Structure object.ligand_chain (
str) – The chain identifier for the ligand within the structure.factor (
float, optional) – The multiplier for the VdW clash threshold for defining a clash. Lower numbers result in less stringent clash detection. Default is 1.0.atoms (
list[str], optional) – A list of atom names to consider for clash calculations. If None, all atoms are considered. If specified, only these atoms will be included in the clash calculation.exclude_ligand_elements (
list[str], optional) – A list of elements that should not be considered during clash detection (e.g. [‘H’]). Default is None
- Returns:
The number of clashes found between the ligand and the rest of the structure.
- Return type:
Examples
Here is an example of how to use the calc_ligand_clashes method:
from Bio.PDB import PDBParser # Load structure from a PDB file parser = PDBParser() structure = parser.get_structure("example", "example.pdb") # Calculate clashes clashes = calc_ligand_clashes_vdw(structure, ligand_chain="A", factor=0.8, atoms=["N", "CA", "C"], exclude_ligand_atoms=["H"]) # clashes will be a float representing the number of clashes
- Further Details:
Clash Calculation: The method calculates the Euclidean distance between all specified atoms of the ligand chain and the rest of the structure. A clash is detected if the distance is less than the sum of their Van der Waals radii multiplied by a set factor.
Usage: This function is useful for evaluating potential steric clashes in molecular docking studies or for validating the positioning of ligands in structural models.
This method is designed to facilitate the detection of steric clashes between ligands and the surrounding structure, providing a quantitative measure of potential conflicts.
- protflow.utils.metrics.calc_ligand_contacts(pose, ligand_chain, min_dist=3, max_dist=5, atoms=None, excluded_elements=None)[source]
Calculate contacts of a ligand within a structure.
This method calculates the number of contacts between a specified ligand chain and the rest of the structure within a specified distance range. Contacts are defined as any pair of atoms (one from the ligand, one from the rest of the structure) where the distance falls between the minimum and maximum specified distances.
- Parameters:
pose (
str | Bio.PDB.Structure.Structure) – The pose representing the structure, which can be a path to a PDB file (str) or a Bio.PDB Structure object.ligand_chain (
str) – The chain identifier for the ligand within the structure.min_dist (
float, optional) – The minimum distance threshold for defining a contact. Default is 3.0.max_dist (
float, optional) – The maximum distance threshold for defining a contact. Default is 5.0.atoms (
list[str], optional) – A list of atom names to consider for contact calculations. If None, all atoms are considered. If specified, only these atoms will be included in the contact calculation.excluded_elements (
list[str], optional) – A list of element symbols to exclude from the contact calculations. Default is [“H”].
- Returns:
The number of contacts normalized by the number of ligand atoms.
- Return type:
Examples
Here is an example of how to use the calc_ligand_contacts method:
from Bio.PDB import PDBParser # Load structure from a PDB file parser = PDBParser() structure = parser.get_structure("example", "example.pdb") # Calculate contacts contacts = calc_ligand_contacts(structure, ligand_chain="A", min_dist=3.0, max_dist=5.0, atoms=["N", "CA", "C"], excluded_elements=["H", "O"]) # contacts will be a float representing the number of contacts normalized by the number of ligand atoms
- Further Details:
Contact Calculation: The method calculates the Euclidean distance between all specified atoms of the ligand chain and the rest of the structure. A contact is counted if the distance is within the specified range (min_dist to max_dist).
Usage: This function is useful for evaluating potential interactions between ligands and the surrounding structure, particularly in drug design and molecular docking studies.
This method is designed to facilitate the detection of relevant contacts between ligands and the surrounding structure, providing a quantitative measure of potential interactions.
- protflow.utils.metrics.calc_rog(pose, min_dist=0, chain=None)[source]
Calculate the radius of gyration of a protein’s alpha carbons.
This function computes the radius of gyration for the alpha carbon atoms (Cα) in a given protein structure.
- Parameters:
- Returns:
The calculated radius of gyration of the protein.
- Return type:
- Raises:
ValueError – If the pose parameter is not of type Bio.PDB.Structure.Structure.
Example
>>> from metrics import calc_rog >>> from Bio.PDB import PDBParser >>> parser = PDBParser() >>> structure = parser.get_structure("example", "example.pdb") >>> rog = calc_rog(structure) >>> print(rog)
- protflow.utils.metrics.calc_rog_of_pdb(pdb_path, min_dist=0, chain=None)[source]
Calculate the radius of gyration of a protein from a PDB file.
This function loads a protein structure from a PDB file and computes the radius of gyration for the alpha carbon atoms (Cα).
- Parameters:
- Returns:
The calculated radius of gyration of the protein.
- Return type:
Example
>>> from metrics import calc_rog_of_pdb >>> rog = calc_rog_of_pdb("example.pdb") >>> print(rog)
- protflow.utils.metrics.calc_sc_tm(input_df, name, ref_col, tm_col)[source]
Calculate self-consistency TM-score in a dataframe.
This function computes the self-consistency TM-score for protein structures and integrates the results into the input dataframe.
- Parameters:
input_df (
pd.DataFrame) – A dataframe containing protein structure data.name (
str) – The name of the new column that should hold the self-consistency TM-score.ref_col (
str) – The column in input_df pointing to the reference description or location.tm_col (
str) – The column in input_df pointing to the TM-scores from TMAlign runner.
- Returns:
The input dataframe with the integrated self-consistency TM-score column.
- Return type:
pd.DataFrame- Raises:
KeyError – If the name column already exists in input_df or if tm_col does not exist in input_df.
ValueError – If ref_col does not point to a description or location column in input_df.
Example
>>> from metrics import calc_sc_tm >>> df = pd.DataFrame({"ref_col": ["A", "B"], "tm_col": [0.9, 0.85]}) >>> updated_df = calc_sc_tm(df, "sc_tm_score", "ref_col", "tm_col") >>> print(updated_df)
- protflow.utils.metrics.calc_sequence_identity(seq1, seq2)[source]
Calculate sequence identity between two protein sequences.
This function computes the sequence identity by comparing two protein sequences of the same length and determining the proportion of matching amino acids.
- Parameters:
- Returns:
The sequence identity as a fraction of matching amino acids.
- Return type:
- Raises:
ValueError – If the input sequences are not of the same length.
Example
>>> from metrics import calc_sequence_identity >>> identity = calc_sequence_identity("ACDEFG", "ACDQFG") >>> print(identity) # Output: 0.8333333333333334
- protflow.utils.metrics.count_mutations(wt, variant)[source]
Compares two protein sequences and counts the number of mutations, returning both the count and a detailed list of mutations.
Each mutation is represented in the format: ‘[original amino acid][position][mutated amino acid]’.
Parameters: seq1 (str): The first protein sequence (e.g., wild type). seq2 (str): The second protein sequence (e.g., variant).
Returns: tuple[int, list[str]]: A tuple where the first element is an integer representing the number of mutations, and the second element is a list of strings detailing each mutation.
Raises: ValueError: If the input sequences are not of the same length.
Example: >>> count_mutations(“ACDEFG”, “ACDQFG”) (1, [‘E4Q’])
- protflow.utils.metrics.entropy(prob_distribution, axis=-1)[source]
Compute element-wise Shannon entropy H(p) = –∑ p·log₂ p along the given axis, safely ignoring any p == 0 terms.
- Parameters:
prob_distribution (
np.array) – An array representing the probability distribution.axis (int)
- Returns:
The calculated entropies of the probability distribution.
- Return type:
np.ndarray
Example
>>> from metrics import entropy >>> prob_dist = np.array([0.1, 0.2, 0.7]) >>> ent = entropy(prob_dist) >>> print(ent) # Output: 1.1567796494470395
- protflow.utils.metrics.get_mutation_indeces(wt, variant)[source]
Find the indices of mutations between two sequences. Can be protein, or nucleic acid sequences.
Parameters: - wt (str): The wild-type sequence. - variant (str): The variant sequence.
Returns: - list[int]: A list of indices where mutations occur (1-based index).
Raises: - ValueError: If the lengths of ‘wt’ and ‘variant’ sequences are not the same.
Description: This function takes two sequences, ‘wt’ (wild-type) and ‘variant’, and returns a list of indices where mutations occur (i.e., where the two sequences differ). The indices are 1-based. If the lengths of ‘wt’ and ‘variant’ are not the same, a ValueError is raised.
Example: >>> wt_sequence = “ACGTAGCT” >>> variant_sequence = “ACCTAGCT” >>> mutations = get_mutation_indeces(wt_sequence, variant_sequence) >>> print(mutations) [3]
protflow.utils.plotting module
Plotting Module
This module provides functionality for creating various standardized plots, primarily focusing on violin plots. It offers tools to generate, customize, and save plots in a structured and automated manner, making it ideal for data visualization within scientific and analytical workflows.
Detailed Description
The PlottingTrajectory class encapsulates the functionality necessary to generate violin plots. It manages the configuration of plot parameters, handles the addition of data, and executes the plotting processes. The class includes methods for customizing the appearance of plots, ensuring the resulting visualizations are both informative and aesthetically pleasing. Additionally, standalone functions for creating scatter plots, sequence logos, and other specific plot types are provided.
The module is designed to streamline the creation and customization of plots, supporting automatic setup of plot parameters, execution of plotting commands, and saving of output files in various formats. This facilitates subsequent data analysis and presentation steps.
Usage
To use this module, create an instance of the PlottingTrajectory class and invoke its methods with appropriate parameters. The module handles the configuration, execution, and result collection processes. Detailed control over the plotting process is provided through various parameters, allowing for customized visualizations tailored to specific research needs.
Examples
Here is an example of how to initialize and use the PlottingTrajectory class:
from plotting import PlottingTrajectory
# Initialize the PlottingTrajectory class
plotter = PlottingTrajectory(y_label="Value", location="output/violin_plot.png")
# Add data to the plot
plotter.add([1, 2, 3, 4, 5], label="Sample 1")
plotter.add([2, 3, 4, 5, 6], label="Sample 2")
# Generate and save the violin plot
plotter.violin_plot()
Standalone function usage:
from plotting import scatterplot
# Create a scatter plot from a DataFrame
scatterplot(dataframe=df, x_column='x_data', y_column='y_data', out_path='output/scatter_plot.png')
Further Details
Edge Cases: The module handles various edge cases, such as empty data lists and invalid plot parameters. It ensures robust error handling and logging for easier debugging and verification of the plotting process.
Customizability: Users can customize plots through multiple parameters, including colormap selection, axis labels, and plot dimensions.
Integration: The module seamlessly integrates with other components of data analysis workflows, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to create detailed and customizable plots for their data analysis and presentation needs. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is designed to work independently or in tandem with other components of data analysis packages, particularly those related to data visualization and presentation.
Authors
Markus Braun, Adrian Tripp
Version
1.0.0
- class protflow.utils.plotting.PlottingTrajectory(y_label, location, title='Refinement Trajectory', dims=None)[source]
Bases:
object
- protflow.utils.plotting.check_for_col_in_df(col, df)[source]
Checks if :col: is in :df: and gives similar columns if not
- protflow.utils.plotting.import_fasta(fasta)[source]
Import Sequences from a Fasta File
This function imports sequences from a fasta file and returns them as a dictionary. Each key-value pair in the dictionary represents a sequence identifier and its corresponding sequence.
- param fasta (str):
The file path of the fasta file to be imported.
- param Detailed Description:
- param ——————–:
- param The import_fasta function reads a fasta file and parses the sequences into a dictionary format. This function is useful for loading sequences into a program for further analysis or manipulation.:
- param The function:
- param - Opens the specified fasta file for reading.:
- param - Parses the file content:
- param extracting sequence identifiers and sequences.:
- param - Returns a dictionary where keys are sequence identifiers and values are sequences.:
Examples
Here is an example of how to use the import_fasta function:
from plotting import import_fasta # Define the input path fasta = 'input/sequences.fasta' # Import the sequences from the fasta file seq_dict = import_fasta(fasta=fasta) # Print the imported sequences print(seq_dict)
Further Details
Edge Cases: The function handles fasta files with varying sequence lengths and multiple sequences, ensuring all sequences are parsed correctly.
Customizability: Users can specify any valid file path for the input fasta file.
Integration: The function can be used as part of larger workflows involving sequence analysis and data import.
This function is intended for researchers and developers who need to import sequences from a fasta file for analysis or manipulation.
- Parameters:
fasta (str)
- protflow.utils.plotting.parse_cols_for_plotting(plot_arg, subst=None)[source]
Parse Columns for Plotting
This function processes the input argument to determine which columns should be used for plotting. It supports different input types and returns a list of column names.
- param plot_arg (str):
The argument indicating which columns to parse. It can be a string, list of strings, or a boolean.
- param subst (str:
A substitute string to use if plot_arg is a boolean set to True. Defaults to None.
- param optional):
A substitute string to use if plot_arg is a boolean set to True. Defaults to None.
- param Detailed Description:
- param ——————–:
- param The parse_cols_for_plotting function is designed to handle various input formats for specifying columns to be used in plotting functions. It ensures that the returned value is always a list of strings:
- param which can then be used to access the appropriate columns in a DataFrame.:
- param The function:
- param - Converts a single string argument into a list containing that string.:
- param - Validates and returns a list of strings if the input is already a list.:
- param - Substitutes the provided string if plot_arg is set to True.:
- param - Raises a TypeError if the input argument type is unsupported.:
Examples
Here is an example of how to use the parse_cols_for_plotting function:
from plotting import parse_cols_for_plotting # Define input arguments plot_arg_str = "column1" plot_arg_list = ["column1", "column2"] plot_arg_bool = True subst = "default_column" # Parse columns for plotting cols_from_str = parse_cols_for_plotting(plot_arg=plot_arg_str) cols_from_list = parse_cols_for_plotting(plot_arg=plot_arg_list) cols_from_bool = parse_cols_for_plotting(plot_arg=plot_arg_bool, subst=subst) print(cols_from_str) # Output: ['column1'] print(cols_from_list) # Output: ['column1', 'column2'] print(cols_from_bool) # Output: ['default_column']
Further Details
Edge Cases: The function handles different input types gracefully, ensuring that a list of strings is always returned.
Customizability: Users can provide a substitute string to use if the input argument is a boolean set to True.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other functions that require column names for plotting.
This function is intended for researchers and developers who need to dynamically determine columns for plotting based on various input formats.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.scatterplot(dataframe, x_column, y_column, color_column=None, size_column=None, labels=None, title=None, show_corr=False, out_path=None, show_fig=False)[source]
Create a Scatter Plot from a DataFrame
This function generates a scatter plot from specified columns of a Pandas DataFrame. It allows for optional customization of point colors and sizes, as well as plot labels and title. The resulting plot can be saved to a specified file path or displayed.
- param dataframe (pd.DataFrame):
A Pandas DataFrame containing the data to be visualized.
- param x_column (str):
The column name to be used for the x-axis data.
- param y_column (str):
The column name to be used for the y-axis data.
- param color_column (str:
The column name to be used for point colors. Defaults to None.
- param optional):
The column name to be used for point colors. Defaults to None.
- param size_column (str:
The column name to be used for point sizes. Defaults to None.
- param optional):
The column name to be used for point sizes. Defaults to None.
- param labels (list[str]:
A list of labels for the axes and optional color and size legends. Defaults to None.
- param optional):
A list of labels for the axes and optional color and size legends. Defaults to None.
- param title (str:
The title of the plot. Defaults to None.
- param optional):
The title of the plot. Defaults to None.
- param out_path (str:
The file path where the plot should be saved. If not provided, the plot will be displayed without saving.
- param optional):
The file path where the plot should be saved. If not provided, the plot will be displayed without saving.
- param show_fig (bool:
Whether to display the plot. Defaults to False.
- param optional):
Whether to display the plot. Defaults to False.
- param Detailed Description:
- param ——————–:
- param The scatterplot function creates a scatter plot to visualize relationships between two variables in a Pandas DataFrame. The function supports optional customization of point colors and sizes:
- param allowing for additional dimensions of data to be represented visually. Custom axis labels:
- param plot title:
- param and legends can be provided to enhance the clarity and informativeness of the plot.:
- param The function:
- param - Validates the presence of specified columns in the DataFrame.:
- param - Generates a scatter plot using the specified x and y columns.:
- param - Optionally colors points based on a specified column.:
- param - Optionally sizes points based on a specified column.:
- param - Sets plot title and axis labels according to the provided parameters.:
- param - Optionally saves the plot to the specified file path.:
Examples
Here is an example of how to use the scatterplot function:
from plotting import scatterplot import pandas as pd # Create a sample DataFrame df = pd.DataFrame({ 'x_data': [1, 2, 3, 4, 5], 'y_data': [5, 4, 3, 2, 1], 'color_data': [10, 20, 30, 40, 50], 'size_data': [100, 200, 300, 400, 500] }) # Define parameters x_column = 'x_data' y_column = 'y_data' color_column = 'color_data' size_column = 'size_data' labels = ["X Axis", "Y Axis", "Color Legend", "Size Legend"] title = "Sample Scatter Plot" # Create and save the scatter plot scatterplot(dataframe=df, x_column=x_column, y_column=y_column, color_column=color_column, size_column=size_column, labels=labels, title=title, out_path="output/scatter_plot.png", show_fig=True)
Further Details
Edge Cases: The function handles missing columns by raising a KeyError and suggesting similar column names.
Customizability: Users can customize the axis labels, plot title, color and size columns, and output path for saving the plot.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other plotting functions and data processing steps.
This function is intended for researchers and developers who need to create detailed and customizable scatter plots for visualizing relationships between variables in a DataFrame.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.sequence_logo(dataframe, input_col, out_path, refseq=None, title=None, resnums=None, units='probability')[source]
Generate a Sequence Logo
This function generates a sequence logo from a column of sequences in a Pandas DataFrame. It allows for customization of the reference sequence, plot title, residue numbers, and units used in the logo. The resulting plot is saved to a specified file path.
- param dataframe (pd.DataFrame):
A Pandas DataFrame containing the sequences to be visualized.
- param input_col (str):
The column name containing the sequences or paths to fasta files.
- param out_path (str):
The file path where the sequence logo should be saved.
- param refseq (str:
A reference sequence or a path to a fasta file containing the reference sequence. Defaults to None.
- param optional):
A reference sequence or a path to a fasta file containing the reference sequence. Defaults to None.
- param title (str:
The title of the sequence logo. Defaults to None.
- param optional):
The title of the sequence logo. Defaults to None.
- param resnums (list:
A list of integers specifying residue positions to include in the logo. Defaults to None.
- param optional):
A list of integers specifying residue positions to include in the logo. Defaults to None.
- param units (str:
The units used in the sequence logo, either “probability” or “bits”. Defaults to “probability”.
- param optional):
The units used in the sequence logo, either “probability” or “bits”. Defaults to “probability”.
- param Detailed Description:
- param ——————–:
- param The sequence_logo function creates a sequence logo to visualize the conservation and variability of sequences. The function supports extensive customization options:
- param including specifying a reference sequence:
- param selecting specific residue positions:
- param and choosing the units for the logo. It ensures that the resulting logo is clearly labeled and informative:
- param providing insights into sequence conservation.:
- param The function:
- param - Validates the presence of the specified input column in the DataFrame.:
- param - Prepares input sequences by handling both direct sequences and paths to fasta files.:
- param - Generates a sequence logo using the weblogo library.:
- param - Customizes the logo with the specified title:
- param reference sequence:
- param residue positions:
- param and units.:
- param - Saves the sequence logo to the specified file path.:
Examples
Here is an example of how to use the sequence_logo function:
from plotting import sequence_logo import pandas as pd # Create a sample DataFrame df = pd.DataFrame({ 'sequences': ['ATGCGT', 'ATGCGC', 'ATGCGG'] }) # Define parameters input_col = 'sequences' out_path = 'output/sequence_logo.eps' refseq = 'ATGCGT' title = 'Sample Sequence Logo' resnums = [1, 2, 3, 4, 5, 6] units = 'bits' # Generate and save the sequence logo sequence_logo(dataframe=df, input_col=input_col, out_path=out_path, refseq=refseq, title=title, resnums=resnums, units=units)
Further Details
Edge Cases: The function handles different input formats for sequences, including direct sequences and paths to fasta files. It also checks for consistent residue lengths.
Customizability: Users can customize the reference sequence, residue positions, plot title, and units for the sequence logo.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other functions for sequence analysis and visualization.
This function is intended for researchers and developers who need to create detailed and customizable sequence logos for visualizing sequence conservation and variability.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.singular_violinplot(data, y_label, title, out_path=None, show_fig=False)[source]
Create a Singular Violin Plot
This function generates a singular violin plot from a provided list of data points. It allows for the customization of the y-axis label and plot title, and optionally saves the resulting plot to a specified file path.
- param data (list):
A list of numerical data points to be visualized in the violin plot.
- param y_label (str):
The label for the y-axis of the plot.
- param title (str):
The title of the plot.
- param out_path (str:
The file path where the plot should be saved. If not provided, the plot will be displayed without saving.
- param optional):
The file path where the plot should be saved. If not provided, the plot will be displayed without saving.
- param Detailed Description:
- param ——————–:
- param The singular_violinplot function creates a violin plot to visualize the distribution of a single set of data points. The plot includes median:
- param quartiles:
- param and range indicators:
- param providing a comprehensive view of the data distribution. The function leverages Matplotlib for plot creation and supports customization of several plot attributes to enhance the visual representation of the data.:
- param The function:
- param - Generates a violin plot for the provided data.:
- param - Sets the plot title and y-axis label according to the provided parameters.:
- param - Highlights the median:
- param quartiles:
- param and range of the data.:
- param - Optionally saves the plot to the specified file path.:
Examples
Here is an example of how to use the singular_violinplot function:
from plotting import singular_violinplot # Define data data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Create a violin plot singular_violinplot(data=data, y_label="Value", title="Sample Violin Plot", out_path="output/violin_plot.png")
Further Details
Edge Cases: The function handles empty data lists by not attempting to plot and logging a message.
Customizability: Users can customize the y-axis label, plot title, and output path for saving the plot.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other plotting functions and data processing steps.
This function is intended for researchers and developers who need to create detailed and customizable violin plots for their data analysis and presentation needs.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.violinplot_multiple_cols(dataframe, cols, y_labels, titles=None, dims=None, out_path=None, show_fig=True)[source]
Create Multiple Violin Plots from DataFrame Columns
This function generates multiple violin plots from specified columns of a single Pandas DataFrame. It allows for customization of axis labels, plot titles, and plot dimensions, and optionally saves the resulting plots to a specified file path.
- param dataframe (pd.DataFrame):
A Pandas DataFrame containing the data to be visualized.
- param cols (list[str]):
A list of column names from the DataFrame to be visualized in the violin plots.
- param y_labels (list[str]):
A list of labels for the y-axes of the plots.
- param titles (list[str]:
A list of titles for each plot. If not provided, plots will not have titles.
- param optional):
A list of titles for each plot. If not provided, plots will not have titles.
- param dims (list[tuple[int:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param int]]:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param optional):
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param out_path (str:
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param optional):
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param show_fig (bool:
Whether to display the plot. Defaults to True.
- param optional):
Whether to display the plot. Defaults to True.
- param Detailed Description:
- param ——————–:
- param The violinplot_multiple_cols function creates multiple violin plots to visualize the distributions of specified columns from a single Pandas DataFrame. The function supports extensive customization options:
- param including axis labels:
- param plot titles:
- param and plot dimensions. It ensures that each plot is clearly labeled and informative:
- param providing a comprehensive view of the data distribution.:
- param The function:
- param - Validates the presence of specified columns in the DataFrame.:
- param - Generates violin plots for each specified column.:
- param - Sets plot titles:
- param y-axis labels:
- param and y-axis limits according to the provided parameters.:
- param - Optionally saves the plots to the specified file path.:
Examples
Here is an example of how to use the violinplot_multiple_cols function:
from plotting import violinplot_multiple_cols import pandas as pd # Create a sample DataFrame df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}) # Define parameters cols = ["col1", "col2", "col3"] y_labels = ["Value 1", "Value 2", "Value 3"] titles = ["Plot 1", "Plot 2", "Plot 3"] # Create and save the violin plots violinplot_multiple_cols(dataframe=df, cols=cols, y_labels=y_labels, titles=titles, out_path="output/violin_plots.png")
Further Details
Edge Cases: The function handles missing columns by raising a KeyError and suggesting similar column names.
Customizability: Users can customize the y-axis labels, plot titles, and output path for saving the plots.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other plotting functions and data processing steps.
This function is intended for researchers and developers who need to create detailed and customizable violin plots for visualizing data from a single DataFrame.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.violinplot_multiple_cols_dfs(dfs, df_names, cols, y_labels, titles=None, dims=None, out_path=None, colormap='tab20', show_fig=True)[source]
Create Multiple Violin Plots from DataFrame Columns
This function generates multiple violin plots from specified columns of multiple Pandas DataFrames. It allows for customization of axis labels, plot titles, and plot dimensions, and optionally saves the resulting plots to a specified file path.
- param dfs (list[pd.DataFrame]):
A list of Pandas DataFrames containing the data to be visualized.
- param df_names (list[str]):
A list of names corresponding to each DataFrame, used for labeling in the legend.
- param cols (list[str]):
A list of column names from the DataFrames to be visualized in the violin plots.
- param y_labels (list[str]):
A list of labels for the y-axes of the plots.
- param titles (list[str]:
A list of titles for each plot. If not provided, plots will not have titles.
- param optional):
A list of titles for each plot. If not provided, plots will not have titles.
- param dims (list[tuple[float:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param float]]:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param optional):
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param out_path (str:
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param optional):
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param colormap (str:
The colormap to be used for coloring the plots. Defaults to “tab20”.
- param optional):
The colormap to be used for coloring the plots. Defaults to “tab20”.
- param show_fig (bool:
Whether to display the plot. Defaults to True.
- param optional):
Whether to display the plot. Defaults to True.
- param Detailed Description:
- param ——————–:
- param The violinplot_multiple_cols_dfs function creates multiple violin plots to visualize the distributions of specified columns from multiple Pandas DataFrames. The function supports extensive customization options:
- param including axis labels:
- param plot titles:
- param colormap selection:
- param and plot dimensions. It is designed to handle the intricacies of plotting data from different DataFrames:
- param ensuring that each plot is clearly labeled and informative.:
- param The function:
- param - Validates the presence of specified columns in the DataFrames.:
- param - Generates violin plots for each specified column across the DataFrames.:
- param - Sets plot titles:
- param y-axis labels:
- param and y-axis limits according to the provided parameters.:
- param - Applies a consistent colormap to enhance visual distinction between different DataFrames.:
- param - Optionally saves the plots to the specified file path.:
Examples
Here is an example of how to use the violinplot_multiple_cols_dfs function:
from plotting import violinplot_multiple_cols_dfs import pandas as pd # Create sample DataFrames df1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) df2 = pd.DataFrame({'col1': [3, 4, 5], 'col2': [6, 7, 8]}) # Define parameters dfs = [df1, df2] df_names = ["DataFrame 1", "DataFrame 2"] cols = ["col1", "col2"] y_labels = ["Value 1", "Value 2"] titles = ["Plot 1", "Plot 2"] # Create and save the violin plots violinplot_multiple_cols_dfs(dfs=dfs, df_names=df_names, cols=cols, y_labels=y_labels, titles=titles, out_path="output/violin_plots.png")
Further Details
Edge Cases: The function handles missing columns by raising a KeyError and suggesting similar column names.
Customizability: Users can customize the y-axis labels, plot titles, colormap, and output path for saving the plots.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other plotting functions and data processing steps.
This function is intended for researchers and developers who need to create detailed and customizable violin plots for comparing data across multiple DataFrames.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.violinplot_multiple_lists(lists, titles, y_labels, dims=None, out_path=None, show_fig=True)[source]
Create Multiple Violin Plots from Lists of Data
This function generates multiple violin plots from specified lists of numerical data. It allows for customization of axis labels, plot titles, and plot dimensions, and optionally saves the resulting plots to a specified file path.
- param lists (list[list[float]]):
A list of lists, where each inner list contains numerical data points to be visualized.
- param titles (list[str]):
A list of titles for each plot.
- param y_labels (list[str]):
A list of labels for the y-axes of the plots.
- param dims (list[tuple[float:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param float]]:
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param optional):
A list of tuples specifying the y-axis limits for each plot. If not provided, the y-axis limits will be determined automatically.
- param out_path (str:
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param optional):
The file path where the plots should be saved. If not provided, the plots will be displayed without saving.
- param show_fig (bool:
Whether to display the plot. Defaults to True.
- param optional):
Whether to display the plot. Defaults to True.
- param Detailed Description:
- param ——————–:
- param The violinplot_multiple_lists function creates multiple violin plots to visualize the distributions of specified lists of numerical data. The function supports extensive customization options:
- param including axis labels:
- param plot titles:
- param and plot dimensions. It ensures that each plot is clearly labeled and informative:
- param providing a comprehensive view of the data distribution.:
- param The function:
- param - Generates violin plots for each specified list of data.:
- param - Sets plot titles:
- param y-axis labels:
- param and y-axis limits according to the provided parameters.:
- param - Optionally saves the plots to the specified file path.:
Examples
Here is an example of how to use the violinplot_multiple_lists function:
from plotting import violinplot_multiple_lists # Define data data1 = [1, 2, 3, 4, 5] data2 = [2, 3, 4, 5, 6] data3 = [3, 4, 5, 6, 7] # Define parameters lists = [data1, data2, data3] titles = ["Plot 1", "Plot 2", "Plot 3"] y_labels = ["Value 1", "Value 2", "Value 3"] # Create and save the violin plots violinplot_multiple_lists(lists=lists, titles=titles, y_labels=y_labels, out_path="output/violin_plots.png")
Further Details
Edge Cases: The function handles empty data lists by not attempting to plot and logging a message.
Customizability: Users can customize the y-axis labels, plot titles, and output path for saving the plots.
Integration: The function can be used as part of larger data analysis workflows, integrating seamlessly with other plotting functions and data processing steps.
This function is intended for researchers and developers who need to create detailed and customizable violin plots for visualizing multiple lists of numerical data.
Notes
This function is part of the Plotting module and is designed to work in tandem with other plotting functions provided in the module.
- protflow.utils.plotting.write_fasta(seq_dict, fasta)[source]
Write Sequences to a Fasta File
This function writes a dictionary of sequences to a fasta file. Each key-value pair in the dictionary represents a sequence identifier and its corresponding sequence.
- param seq_dict (dict):
A dictionary where keys are sequence identifiers and values are sequences.
- param fasta (str):
The file path where the fasta file should be saved.
- param Detailed Description:
- param ——————–:
- param The write_fasta function is designed to facilitate the creation of fasta files from a dictionary of sequences. This function iterates over the dictionary and writes each sequence to the specified file in the standard fasta format.:
- param The function:
- param - Opens the specified file for writing.:
- param - Iterates over the dictionary:
- param writing each sequence identifier and sequence to the file.:
- param - Ensures that the file is saved in the correct fasta format.:
Examples
Here is an example of how to use the write_fasta function:
from plotting import write_fasta # Define a sequence dictionary seq_dict = { 'seq1': 'ATGCGT', 'seq2': 'ATGCGC', 'seq3': 'ATGCGG' } # Define the output path fasta = 'output/sequences.fasta' # Write the sequences to the fasta file write_fasta(seq_dict=seq_dict, fasta=fasta)
Further Details
Edge Cases: The function handles dictionaries with varying sequence lengths, ensuring each sequence is written correctly.
Customizability: Users can specify any valid file path for the output fasta file.
Integration: The function can be used as part of larger workflows involving sequence analysis and data export.
This function is intended for researchers and developers who need to export sequences to a fasta file for further analysis or sharing.
protflow.utils.pymol_tools module
Module that contains functions relevant for working with pymol and writing pymol scripts
protflow.utils.utils module
General Utility Functions for ProtFlow
This module provides a collection of general utility functions designed to support various operations within the ProtFlow package. These utilities include functions for parsing data files, calculating molecular interactions, and other common tasks needed in bioinformatics and structural biology workflows.
Examples
Here is an example of how to use the parse_fasta_to_dict function:
# Parse a FASTA file
fasta_dict = parse_fasta_to_dict('example.fasta')
for desc, seq in fasta_dict.items():
print(f"{desc}: {seq}")
This module is designed to provide essential utilities for common tasks encountered in bioinformatics and structural biology, facilitating the development of more complex workflows within the ProtFlow package.
Authors
Markus Braun, Adrian Tripp
- protflow.utils.utils.add_group_statistics(df, group_col, prefix, statistics=('min', 'mean', 'median', 'max', 'std'))[source]
Add group-based statistical features to the DataFrame for numeric columns only.
This function groups the DataFrame by the specified
group_coland computes the specified statistics (default:min,mean,median,max,std) for all numeric columns whose names start with the givenprefix. The computed statistics exclude NaN values and are merged back into the original DataFrame, with new columns named<original_column>_<statistic>.- Parameters:
df (
pandas.DataFrame) – The input DataFrame to process.group_col (
str) – The name of the column to group by.prefix (
str) – Only numeric columns whose names start with this prefix will be considered.statistics (
listofstr, optional) – Statistical functions to apply. Supported values includemin,mean,median,max,std,sum,count. Defaults to [min,mean,median,max,std].
- Returns:
A new DataFrame containing the original data plus one new column per statistic for each selected column. New columns use the format
<original_column>_<statistic>.- Return type:
- Raises:
ValueError – If
group_colis not indf.ValueError – If no numeric columns match
prefix.ValueError – If any entry of
statisticsis not supported by pandas.
Examples
>>> import pandas as pd >>> data = { ... "group_col": ["A", "A", "B", "B", "B"], ... "start_str1": [10, 20, 30, 40, 50], ... "start_str2": [5, 15, 25, 35, 45], ... "start_str3": ["x", "y", "z", "w", "v"], ... "other_col": [100, 200, 300, 400, 500], ... } >>> df = pd.DataFrame(data) >>> result = add_group_statistics( ... df, ... group_col="group_col", ... prefix="start_str", ... statistics=["min", "mean", "max"] ... ) >>> print(result) group_col start_str1 start_str2 other_col start_str1_min start_str1_mean start_str1_max start_str2_min start_str2_mean start_str2_max 0 A 10 5 100 10 15.0 20 5 10.0 15 1 A 20 15 200 10 15.0 20 5 10.0 15 2 B 30 25 300 30 40.0 50 25 35.0 45 3 B 40 35 400 30 40.0 50 25 35.0 45 4 B 50 45 500 30 40.0 50 25 35.0 45
Notes
Only numeric columns beginning with
prefixare included; others are ignored.NaN values are dropped before computing each statistic.
If any new column name collides with an existing one, it will overwrite it.
- protflow.utils.utils.parse_fasta_to_dict(fasta_path, encoding='UTF-8')[source]
Parses a FASTA file, converting it into a dictionary mapping sequence descriptions to sequences.
This function opens and reads a FASTA file from the given path, then parses the contents to create a dictionary. Each entry in the FASTA file should start with a ‘>’ character, followed by the description line. The subsequent lines until the next ‘>’ character are considered as the sequence associated with that description. The sequence is concatenated into a single string if it spans multiple lines.
- Parameters:
fasta_path (
str) – The file path to the FASTA file that needs to be parsed. The path should be a valid path to a file that exists and is readable. If the file cannot be found or opened, a FileNotFoundError will be raised.encoding (
str, optional) – The character encoding of the FASTA file. This is useful for files that might have been created in non-UTF-8 encoding. Defaults to “UTF-8”.
- Returns:
A dictionary where the keys are the descriptions of sequences (without the ‘>’ character), and the values are the sequences themselves. Sequences that span multiple lines in the FASTA file are concatenated into a single string.
- Return type:
dict[str,str]
Examples
Assuming we have a FASTA file example.fasta with the following content:
>seq1 AGTCAGTC >seq2 GTCAACGT
Parsing this file:
>>> fasta_dict = parse_fasta_to_dict('example.fasta') >>> fasta_dict['seq1'] 'AGTCAGTC' >>> fasta_dict['seq2'] 'GTCAACGT'
- protflow.utils.utils.sequence_dict_to_fasta(seq_dict, out_path, combined_filename=None)[source]
Writes protein sequences stored into seq_dict {‘description’: seq, …} to .fa files. If combined_filename is specified, all sequences will be written into one file.
- protflow.utils.utils.vdw_radii()[source]
from https://en.wikipedia.org/wiki/Atomic_radii_of_the_elements_(data_page), accessed 30.1.2023
Module contents
Init of protflow.utils