protflow.tools package

Submodules

protflow.tools.boltz module

ProtFlow runner for Boltz.

This module provides a high-level Boltz runner that: (1) prepares Boltz-compatible YAML inputs from sequences or structures, (2) composes command lines from global and pose-specific options, (3) distributes inference across available cores via a JobStarter, and (4) aggregates Boltz outputs (confidence, affinity, NPZ artifacts) into a single score table for downstream orchestration.

The typical workflow is:

  1. Ensure paths and environment hooks for Boltz are configured (see Notes on BOLTZ_PATH, BOLTZ_PYTHON, BOLTZ_PRE_CMD).

  2. Provide inputs as a Poses collection (FASTA, PDB/CIF, or already Boltz-formatted YAML). If needed, convert to YAML with convert_poses_to_boltz_yaml.

  3. Call Boltz.run(…) with command-line options and optional pose_options to fan-out runs.

  4. Consume the returned Poses object whose .df is augmented with a per-model score table and file locations of produced artifacts.

Notes

  • Configuration keys The runner reads its defaults from ProtFlow’s config via: BOLTZ_PATH (path to the boltz CLI entry point or module), BOLTZ_PYTHON (interpreter used to invoke Boltz), and BOLTZ_PRE_CMD (shell prefix such as environment activation). Use protflow.config utilities to set these once per environment.

  • MSA handling Boltz can run with an empty MSA or fetch MSAs from a server. The runner exposes msa_setting to steer YAML content (“empty” vs “server”), while the CLI switch –use_msa_server remains the source of truth for server fetching. See Boltz._parse_msa_setting and convert_chain_seq_dict_to_yaml_dict.

Examples

Run Boltz on a batch of structures, writing outputs to a fresh work directory and collecting scores:

>>> from protflow.runners.boltz import Boltz
>>> from protflow.poses import Poses
>>> poses = Poses(
...     files=["A.pdb", "B.pdb", "C.pdb"],
...     work_dir="work/boltz_demo"
... )
>>> runner = Boltz()  # uses config defaults (BOLTZ_PATH/PYTHON/PRE_CMD)
>>> poses = runner.run(
...     poses=poses,
...     prefix="boltz_run",
...     options="--num_samples 4 --use_msa_server",
...     overwrite=False,
... )
>>> poses.df.columns[:8]  # score columns will include confidence & file paths
...
class protflow.tools.boltz.Boltz(boltz_path=None, boltz_python=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

The Boltz runner prepares inputs (optionally batching by core), assembles Boltz commands, dispatches them via a JobStarter, and aggregates results into a unified score file stored in the run directory.

Parameters:
  • boltz_path (str, optional) – Executable or module path used with predict subcommand. If not provided, loaded from BOLTZ_PATH in the ProtFlow config.

  • boltz_python (str, optional) – Python interpreter used to execute Boltz. Defaults to BOLTZ_PYTHON from the ProtFlow config.

  • pre_cmd (str, optional) – Shell prefix prepended to each command. Use this to activate environments or modules (e.g., conda activate boltz). If omitted, taken from BOLTZ_PRE_CMD in the ProtFlow config.

  • jobstarter (JobStarter, optional) – Default jobstarter to use if none is provided to run().

name

Fixed runner name: “Boltz”.

Type:

str

index_layers

Number of index layers used when merging outputs (defaults to 2).

Type:

int

jobstarter

Optional default jobstarter stored on the runner instance.

Type:

JobStarter or None

boltz_path

Resolved Boltz executable/module path.

Type:

str

boltz_python

Resolved interpreter path.

Type:

str

pre_cmd

Resolved shell prefix (may be empty).

Type:

str

Notes

  • Score caching If a score file already exists for the given prefix and format and overwrite is False (and –override not present in options), existing results are returned without re-running Boltz.

  • Batching behavior If pose_options are not provided, inputs are automatically split into at most jobstarter.max_cores batches to improve throughput.

Examples

Minimal run with default configuration, batched across cores:

>>> runner = Boltz()
>>> poses = runner.run(
...     poses, prefix="demo",
...     options="--num_samples 2 --use_msa_server"
... )
__init__(boltz_path=None, boltz_python=None, pre_cmd=None, jobstarter=None)[source]

Initialize the Boltz runner and resolve configuration.

Parameters:
  • boltz_path (str, optional) – Path to the Boltz program or module (with predict subcommand). Defaults to BOLTZ_PATH from ProtFlow config.

  • boltz_python (str, optional) – Interpreter to call Boltz with. Defaults to BOLTZ_PYTHON.

  • pre_cmd (str, optional) – Shell prefix (e.g., environment activation). Defaults to BOLTZ_PRE_CMD.

  • jobstarter (JobStarter, optional) – Default jobstarter to use when run(jobstarter=None).

Raises:

KeyError – If required configuration keys are missing from the ProtFlow config.

__str__()[source]

String representation.

Returns:

The literal string "Boltz".

Return type:

str

run(poses, prefix, jobstarter=None, options=None, pose_options=None, params=None, overwrite=False, msa_setting='')[source]

Execute Boltz on the given poses and collect results.

The runner prepares inputs (converting to Boltz YAML if needed), resolves MSA behavior, optionally augments pose YAMLs using a provided BoltzParams object, dispatches the commands via JobStarter, then aggregates prediction confidence/affinity scores and artifact paths into a DataFrame saved as {prefix}/{name}_scores.{storage_format}.

Parameters:
  • poses (Poses) – Input poses. Has to be protflow.poses.Poses class with poses in FASTA, PDB/CIF, or Boltz YAML; if not YAML, they are converted with convert_poses_to_boltz_yaml.

  • prefix (str) – Run prefix / subdirectory under poses.work_dir. Boltz outputs will be stored in {poses.work_dir}/{prefix}/output

  • jobstarter (JobStarter, optional) – Overrides the runner’s default jobstarter. If omitted, the runner tries, in order: the provided value, the instance default, and poses.default_jobstarter.

  • options (str, optional) – Global CLI options for Boltz (e.g., "--num_samples 8", "--use_msa_server").

  • pose_options (str or list of str, optional) – Pose-specific option template(s); if provided, disables batching.

  • params (BoltzParams, optional) – If given, used to modify or extend per-pose YAMLs (e.g., sequences, ligands, constraints, templates, properties) before running. Files are emitted under {prefix}/boltz_inputs/.

  • overwrite (bool, optional) – If True (or if –override is present in options), re-run even if a scorefile already exists.

  • msa_setting (str, optional) – One of {"server", "empty", ""}. Empty/None means auto-resolve based on options (presence of –use_msa_server).

Returns:

The original Poses with results merged and indices layered. Artifacts (models, NPZs) are recorded as path columns.

Return type:

Poses

Raises:
  • RuntimeError – If Boltz finishes without producing any scores.

  • TypeError – If inputs cannot be converted to Boltz YAML (unsupported formats).

Examples

Convert PDBs to YAML, add a ligand, and run with 4 samples per pose:

>>> from protflow.runners.boltz import Boltz
>>> from protflow.runners.boltz import BoltzParams
>>> params = BoltzParams()
>>> params.add_ligand(ligand="CC(=O)O", id="LIG", ligand_type="smiles")
>>> runner = Boltz()
>>> poses = runner.run(
...     poses=poses,
...     prefix="boltz_with_ligand",
...     params=params,
...     options="--num_samples 4",
...     overwrite=True
... )

Notes

  • Score caching: if a prior score file exists and neither overwrite nor –override is set, the runner returns cached results to save time.

  • Batching: when pose_options is absent, inputs are partitioned into at most jobstarter.max_cores batch folders to parallelize runs.

  • Artifacts: columns like plddt_location, pae_location, and pde_location point to NPZ files produced by Boltz for each model.

  • Override behavior: Boltz Runner sets overwrite=True if –override is specified in options (does not work for pose_options)!

class protflow.tools.boltz.BoltzParams[source]

Bases: object

Builder for per-pose Boltz YAML content.

Collects entries for proteins, nucleic acids, ligands, constraints, templates, and arbitrary properties. Each field value can be provided either as a literal or as a reference to a column in poses.df. Column-referenced values are marked by passing their keys via poses_cols and are resolved at YAML generation time.

Notes

  • Each added entity is stored internally and later rendered into the final YAML structure via generate_yaml_files().

  • For sequence modifications, use a list of dicts with at least {"position": <int>, "ccd": <str>}.

__init__()[source]

Initialize an empty parameter collection.

The instance accumulates lists: proteins, dna, rna, ligands, constraints, templates, and properties—all of which are reflected into the resulting YAML during generate_yaml_files().

add_constraint(constraint_type, poses_cols=None, **kwargs)[source]

Add a geometric or pocket constraint.

Parameters:
  • constraint_type (str) – One of typical types such as "bond", "angle", "dihedral", "contact", or "pocket" (see Notes for expected fields).

  • poses_cols (list[str], optional) – Keys in kwargs that should be read from poses.df.

  • **kwargs – Constraint parameters (literal values or column names if listed in poses_cols).

Return type:

None

Examples

Contact constraint between two tokens: >>> bp.add_constraint( … “contact”, … token1=[“A”, 42], token2=[“B”, “CA”], max_distance=6.0 … )

Notes

  • bond/angle/dihedral expect standard token lists like ["CHAIN", RES_IDX/ATOM_NAME].

  • pocket typically expects a binder (chain) and a list of pocket contacts plus an optional max_distance.

add_dna(sequence, id, modifications=None, cyclic=False, poses_cols=None)[source]

Add a DNA entry.

Parameters:
  • sequence (str) – Nucleotide sequence (literal or column name).

  • id (str or list[str]) – Identifier(s) for the DNA entry.

  • modifications (list[dict] or None, optional) – Residue-level modifications for DNA.

  • cyclic (bool, optional) – Whether the polymer is cyclic.

  • poses_cols (list[str], optional) – Keys to interpret as column names in poses.df.

Return type:

None

add_ligand(ligand, id, ligand_type='smiles', poses_cols=None)[source]

Add a ligand entry.

Parameters:
  • ligand (str) – The ligand specification. For ligand_type="smiles", provide a SMILES; for "ccd", provide an RCSB CCD ID.

  • id (str or list[str]) – Ligand ID(s) in the output YAML.

  • ligand_type ({"smiles", "ccd"}) – How to interpret ligand.

  • poses_cols (list[str], optional) – Keys (e.g., ["ligand", "id"]) to read from poses.df. "ligand_type" is not supported as a pose-column.

Return type:

None

Raises:

ValueError – If "ligand_type" is included in poses_cols.

add_property(property_type, poses_cols=None, **kwargs)[source]

Attach arbitrary key–value properties to the YAML.

Parameters:
  • property_type (str) – A top-level property category (e.g., "inference").

  • poses_cols (list[str], optional) – Keys in kwargs that should be read from poses.df.

  • **kwargs – Property payload (literal values or column names if listed in poses_cols).

Return type:

None

Examples

>>> BoltzParams.add_property('affinity', binder="binder_chain_col", poses_cols=["binder"])
>>> BoltzParams.add_property('affinity', binder="B")
add_protein(sequence, id, msa=False, modifications=None, cyclic=False, poses_cols=None)[source]

Helper to add protein entry.

Parameters:
  • sequence (str) – Amino-acid sequence; may be a literal or a column name (see Notes).

  • id (str or list[str]) – Chain ID(s) to use in the YAML; may be literal or a column name.

  • modifications (list[dict] or None, optional) – Per-residue modifications (see _check_modifications_format()). e.g. [{“position”: RES_IDX, “ccd”: CCD}, …] (can also be a string pointing to a column in poses.df that contains the modifications dicts)

  • cyclic (bool, optional) – Whether the peptide is cyclic.

  • poses_cols (list[str], optional) – Keys that should be read from poses.df instead of used literally, e.g. ["sequence", "id", "modifications"].

  • msa (str | bool)

Return type:

None

Examples

>>> bp.add_protein(sequence="ACDE...", id="A")
>>> bp.add_protein(sequence="seq_col", id="chain_id_col", poses_cols=["sequence", "id"])

Notes

Any key named in poses_cols is treated as a reference to a column in the current pose row when rendering YAML.

add_rna(sequence, id, modifications=None, cyclic=False, poses_cols=None)[source]

Add an RNA entry.

Parameters:
  • sequence (str) – Nucleotide sequence (literal or column name).

  • id (str or list[str]) – Identifier(s) for the RNA entry.

  • modifications (list[dict] or None, optional) – Residue-level modifications for RNA.

  • cyclic (bool, optional) – Whether the polymer is cyclic.

  • poses_cols (list[str], optional) – Keys to interpret as column names in poses.df.

Return type:

None

add_template(template, template_type, poses_cols=None, **kwargs)[source]

Add a structural template. In **kwargs, add the parameters of the given template that you want to use.

Parameters:
  • template (str) – Path or identifier of the template (literal or column name).

  • template_type ({"pdb", "cif"}) – Template format.

  • poses_cols (list[str], optional) – Keys (including any in kwargs) to be read from poses.df.

  • **kwargs – Additional template parameters supported by Boltz (e.g., chain selection, residue ranges).

Returns:

  • None

  • See the original Boltz documentation for details (https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md)

Return type:

None

generate_yaml_files(poses, out_dir, reset_poses=True)[source]

Converts poses into new .yaml files at ‘prefix’ based on current paramters. or: render accumulated parameters into per-pose YAML files.

Resolves all values that were marked as pose-columns against poses.df and writes one YAML per pose into out_dir. Optionally updates poses.df["poses"] to point to the new files.

Parameters:
  • poses (Poses) – Poses whose table provides column values for pose-bound fields.

  • out_dir (str) – Output directory where YAML files are written.

  • reset_poses (bool, optional) – If True, replace the poses column with the new YAML paths.

Return type:

None

Raises:

KeyError – If a requested pose-column is missing from poses.df.

class protflow.tools.boltz.FlowSeq(iterable=(), /)[source]

Bases: list

Marker list that forces YAML flow style.

When dumped with MyDumper, lists of this type are emitted as [a, b, c] on one line rather than block style. Used to keep compact representations for IDs and token tuples in Boltz YAMLs. :contentReference[oaicite:4]{index=4}

class protflow.tools.boltz.MyDumper(stream, default_style=None, default_flow_style=False, canonical=None, indent=None, width=None, allow_unicode=None, line_break=None, encoding=None, explicit_start=None, explicit_end=None, version=None, tags=None, sort_keys=True)[source]

Bases: SafeDumper

YAML dumper enabling flow-style emission for FlowSeq.

protflow.tools.boltz.boltz_yaml_reader(in_path)[source]

Read a Boltz YAML file into a Python dictionary.

Parameters:

in_path (str) – Path to a .yaml file.

Returns:

Parsed YAML document.

Return type:

dict

protflow.tools.boltz.boltz_yaml_writer(out_path, boltz_yaml)[source]

Write a Boltz YAML document to disk (pretty, stable layout).

Parameters:
  • out_path (str) – Output .yaml path.

  • boltz_yaml (dict) – YAML document to write (will be processed for flow-style lists).

Return type:

None

protflow.tools.boltz.collect_boltz_scores(boltz_output_dir)[source]

Aggregate per-model Boltz outputs into a Pandas DataFrame.

Expects the Boltz output layout: {boltz_output_dir}/{input}/predictions/{pose}/ containing: - structure files: {pose}_model_*.cif or .pdb - confidence JSONs: confidence_{pose}_model_{i}.json - optional affinity JSON: affinity_{pose}.json - NPZ artifacts per model: plddt_*, pae_*, pde_*

Parameters:

boltz_output_dir (str) – Top-level directory passed to Boltz via --out_dir.

Returns:

One row per model with at least: description, location, and paths for plddt_location, pae_location, pde_location; plus all JSON keys.

Return type:

pandas.DataFrame

Notes

The description column is {pose}_model_{rank} and location points to the corresponding .pdb/.cif model file. :contentReference[oaicite:3]{index=3}

protflow.tools.boltz.convert_chain_seq_dict_to_yaml_dict(chain_seq_dict, msa=None, ignore_nonexistent_msa_file=False)[source]

Converts dictionary that contains {chain: seq, …} into boltz-compatible protein entries {}. When msa is set to ‘server’, the function will set <msa: empty> (use option –use_msa_server!)

Convert a chain→sequence mapping into Boltz YAML “protein” entries.

Parameters:
  • chain_seq_dict (dict[str, str]) – Mapping from chain ID to amino-acid sequence.

  • msa ({"server", "empty", "auto"} or str or None, optional) – If "server"/"empty"/"auto"/None → write "msa": "empty" per chain. If a string path → use it as the MSA file for all chains (exists unless ignore_nonexistent_msa_file=True).

  • ignore_nonexistent_msa_file (bool, optional) – If True, skip the existence check for the path given in msa.

Returns:

One dict per chain with keys id, sequence, and msa.

Return type:

list of dict

Raises:
  • FileNotFoundError – If msa is a path that does not exist and ignore_nonexistent_msa_file is False.

  • ValueError – If msa is not one of the accepted values.

Examples

>>> convert_chain_seq_dict_to_yaml_dict({"A": "ACDE", "B": "FGHI"}, msa="empty")
[{'id': 'A', 'sequence': 'ACDE', 'msa': 'empty'}, {'id': 'B', 'sequence': 'FGHI', 'msa': 'empty'}]
protflow.tools.boltz.convert_poses_to_boltz_yaml(poses, prefix, msa=None, overwrite=True, reset_poses=True)[source]

For now, this only reads the protein sequence, not anything else (no ligand support).

Convert input poses to Boltz-compatible YAMLs.

Creates one YAML per pose under {poses.work_dir}/{prefix}, encoding chain sequences (and MSA choice) for Boltz. Optionally updates poses.df["poses"] to point to the newly created YAMLs.

Parameters:
  • poses (Poses) – Input poses (protflow.poses.Poses class); poses must be in FASTA/PDB/CIF format poses table.

  • prefix (str) – Subdirectory name under poses.work_dir where YAMLs are written.

  • msa (str or None) – One of "server", "empty", or a path to a custom .a3m file. "server" writes empty MSA entries and expects Boltz to fetch MSAs.

  • overwrite (bool, optional) – If True, existing YAMLs for the same prefix are replaced.

  • reset_poses (bool, optional) – If True, replace the poses column with YAML paths.

Return type:

None

Raises:
  • KeyError – If the output columns for this prefix already exist in poses.df.

  • ValueError – If msa is neither "server", "empty", a valid path, nor None.

Examples

>>> convert_poses_to_boltz_yaml(poses, prefix="boltz_inputs", msa="empty")
>>> convert_poses_to_boltz_yaml(poses, prefix="boltz_inputs_srv", msa="server", reset_poses=False)

Notes

  • The function is sequence-centric (ligands/templates/properties are handled later via BoltzParams).

protflow.tools.boltz.edit_boltz_yaml(*args, **kwargs)[source]

Placeholder for future YAML editing utilities.

Raises:

NotImplementedError – Always raised; function is a stub.

Return type:

None

protflow.tools.boltz.idx_to_char(idx)[source]

Convert a 0-based index to letters like Excel columns. 0 -> ‘A’, 25 -> ‘Z’, 26 -> ‘AA’, 27 -> ‘AB’, …

Parameters:

idx (int)

Return type:

str

protflow.tools.attnpacker module

AttnPacker Module

This module provides the functionality to integrate AttnPacker within the ProtFlow framework. It offers tools to run AttnPacker, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The AttnPacker class encapsulates the functionality necessary to execute AttnPacker runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of packing processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of AttnPacker into larger computational workflows. It supports the automatic setup of job parameters, execution of AttnPacker commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the AttnPacker class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the packing process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the AttnPacker class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from attnpacker import AttnPacker

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the AttnPacker class
attnpacker = AttnPacker()

# Run the packing process
results = attnpacker.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="packing.num_designs=10",
    pose_options=["packing.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the packing process.

  • Customizability: Users can customize the packing process through multiple parameters, including the number of packings, specific options for the AttnPacker script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate AttnPacker into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.attnpacker.AttnPacker(attnpacker_dir=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

AttnPacker Class

The AttnPacker class is a specialized class designed to facilitate the execution of AttnPacker within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with packing processes.

Detailed Description

The AttnPacker class manages all aspects of running AttnPacker simulations. It handles the configuration of necessary scripts and executables, prepares the environment for packing processes, and executes the packing commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to AttnPacker scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of AttnPacker commands with support for multiple packings.

  • Collecting and processing output data into a pandas DataFrame.

  • Ensuring robust error handling and logging for easier debugging and verification.

rtype:

An instance of the `AttnPacker class`, configured to run AttnPacker processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises KeyError:

Examples

Here is an example of how to initialize and use the AttnPacker class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from attnpacker import AttnPacker

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the AttnPacker class
attnpacker = AttnPacker()

# Run the packing process
results = attnpacker.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the packing process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The AttnPacker class is intended for researchers and developers who need to perform packing simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(attnpacker_dir=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

sbatch_options are set automatically, but can also be manually set. Manual setting is not recommended.

Parameters:
  • attnpacker_dir (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (str)

Return type:

None

run(poses, prefix, jobstarter=None, overwrite=False)[source]

Execute the AttnPacker process with given poses and jobstarter configuration.

This method sets up and runs the AttnPacker process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An updated Poses object containing the processed poses and results of the AttnPacker process.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • KeyError – If forbidden options are included in the command parameters.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from attnpacker import AttnPacker

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the AttnPacker class
attnpacker = AttnPacker()

# Run the packing process
results = attnpacker.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It sets up specific directories for output PDBs and checks for existing score files.

  • Output Management: The method handles the collection and processing of output data, reading scores from a CSV file and organizing results into a structured DataFrame. It ensures that results are accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the packing process to their specific needs. Users can specify additional options and pose-specific parameters for the AttnPacker script.

This method is designed to streamline the execution of AttnPacker processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze packing simulations.

write_cmd(json_path, output_dir)[source]

Write the command to run the AttnPacker script for a given pose.

This method constructs the command line string necessary to execute the AttnPacker script for a given pose. It incorporates the specified options and pose-specific parameters, ensuring that the command is correctly formatted and includes all required arguments. It also checks for forbidden options to prevent conflicts.

Parameters:
  • pose_path (str) – The path to the input PDB file for the pose.

  • output_dir (str) – The directory where output files will be stored.

  • json_path (str)

Returns:

The command line string to execute the AttnPacker script with the specified parameters.

Return type:

str

Raises:

KeyError – If forbidden options are included in the command parameters.

Examples

Here is an example of how to use the write_cmd method:

from attnpacker import AttnPacker

# Initialize the AttnPacker class
attnpacker = AttnPacker()

# Define the input pose path and output directory
pose_path = "input.pdb"
output_dir = "/path/to/output"

# Write the command with additional options and pose-specific parameters
cmd = attnpacker.write_cmd(
    pose_path=pose_path,
    output_dir=output_dir,
)

# Print the command
print(cmd)
Further Details:
  • Command Construction: This method ensures that the command string is correctly constructed with all necessary arguments. It includes paths to the script directory, output directory, input PDB file, and score file, as well as any additional options.

  • Validation: The method checks for forbidden options that could conflict with the required arguments, raising a KeyError if any are found. This helps ensure that the command is valid and will run correctly.

This method is designed to facilitate the execution of AttnPacker processes within the ProtConductor framework, providing a flexible and robust way to construct and run commands for packing simulations.

Parameters:
  • attnpacker_dir (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (str)

protflow.tools.attnpacker.collect_scores(scores_dir)[source]
Parameters:

scores_dir (str)

protflow.tools.colabfold module

ColabFold Module

This module provides functionality to integrate ColabFold within the ProtFlow framework, enabling the execution of AlphaFold2 runs on ColabFold. It includes tools to handle inputs, execute runs, and process outputs in a structured and automated manner.

Detailed Description

The ColabFold class encapsulates all necessary functionalities to run AlphaFold2 through ColabFold. It manages the configuration of essential scripts and paths, sets up the environment, and handles the execution of prediction processes. The class also includes methods for collecting and processing output data, ensuring the results are organized and accessible for further analysis within the ProtFlow ecosystem.

This module streamlines the integration of ColabFold into larger computational workflows by supporting the automatic setup of job parameters, execution of ColabFold commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the ColabFold class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the prediction process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the ColabFold class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from colabfold import ColabFold

# Create instances of necessary classes
poses = Poses()
jobstarter = LocalJobStarter(max_cores=4)

# Initialize the ColabFold class
colabfold = ColabFold()

# Run the prediction process
results = colabfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="--msa-mode single-sequence",
    pose_options=None,
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the prediction process.

  • Customizability: Users can customize the prediction process through multiple parameters, including the number of diffusions, specific options for the ColabFold script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate ColabFold into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.colabfold.Colabfold(script_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

ColabFold Class

The ColabFold class is a specialized class designed to facilitate the execution of AlphaFold2 within the ColabFold environment as part of the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with AlphaFold2 prediction processes.

Detailed Description

The ColabFold class manages all aspects of running AlphaFold2 predictions through ColabFold. It handles the configuration of necessary scripts and executables, prepares the environment for the prediction processes, and executes the prediction commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to ColabFold scripts and necessary directories.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of AlphaFold2 prediction commands with support for batch processing.

  • Collecting and processing output data into a pandas DataFrame.

  • Managing input FASTA files and preparing them for prediction.

  • Overwriting previous results if specified.

rtype:

An instance of the `ColabFold class`, configured to run AlphaFold2 prediction processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the ColabFold class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from colabfold import ColabFold

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ColabFold class
colabfold = ColabFold()

# Run the prediction process
results = colabfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="inference.num_designs=10",
    pose_options=["inference.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the prediction process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ColabFold class is intended for researchers and developers who need to perform AlphaFold2 predictions as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(script_path=None, pre_cmd=None, jobstarter=None)[source]
__init__ Method

The __init__ method initializes an instance of the ColabFold class, setting up necessary configurations for running AlphaFold2 predictions through ColabFold within the ProtFlow framework.

Detailed Description

This method sets up the paths to the ColabFold script and initializes default values for various attributes required for running predictions. It also allows for the optional configuration of a job starter.

param script_path:

The path to the ColabFold script. Defaults to protflow.config.COLABFOLD_SCRIPT_PATH.

type script_path:

str, optional

param jobstarter:

An instance of the JobStarter class for managing job execution. Defaults to None.

type jobstarter:

JobStarter, optional

returns:

None

raises ValueError:

If the script_path is not provided.

Examples

Here is an example of how to initialize the ColabFold class:

from colabfold import ColabFold

# Initialize the ColabFold class
colabfold = ColabFold(script_path='/path/to/colabfold.py', jobstarter=jobstarter)
Parameters:
Return type:

None

prep_a3m_for_prediction(poses, fasta_dir, max_filenum)[source]

TODO: Write Docstring.

Parameters:
Return type:

list[str]

prep_fastas_for_prediction(poses, fasta_dir, max_filenum)[source]
prep_fastas_for_prediction Method

The prep_fastas_for_prediction method prepares input FASTA files for AlphaFold2 predictions by splitting the input sequences into batches and writing them to files.

Detailed Description

This method divides the input protein sequences into batches, which can help in managing computational resources effectively. It writes the batches into FASTA files stored in the specified directory.

param poses:

List of paths to input FASTA files.

type poses:

list[str]

param fasta_dir:

Directory where the new FASTA files will be written.

type fasta_dir:

str

param max_filenum:

Maximum number of FASTA files to write.

type max_filenum:

int

returns:

List of paths to the prepared FASTA files.

rtype:

list[str]

raises None:

Examples

Here is an example of how to use the prep_fastas_for_prediction method:

# Prepare input FASTA files for prediction
fastas = colabfold.prep_fastas_for_prediction(poses=poses_list, fasta_dir='/path/to/fasta_dir', max_filenum=10)
Parameters:
Return type:

list[str]

run(poses, prefix, jobstarter=None, options=None, pose_options=None, overwrite=False, return_top_n_poses=1)[source]
run Method

The run method of the ColabFold class executes AlphaFold2 predictions using ColabFold within the ProtFlow framework. It manages the setup, execution, and result collection processes, providing a streamlined way to integrate AlphaFold2 predictions into larger computational workflows.

Detailed Description

This method orchestrates the entire prediction process, from preparing input data and configuring the environment to running the prediction commands and collecting the results. The method supports batch processing of input FASTA files and handles various edge cases, such as overwriting existing results and managing job starter options.

param poses:

The Poses object containing the protein structures.

type poses:

Poses

param prefix:

A prefix used to name and organize the output files.

type prefix:

str

param jobstarter:

An instance of the JobStarter class, which manages job execution. Defaults to None.

type jobstarter:

JobStarter, optional

param options:

Additional options for the AlphaFold2 prediction commands. Defaults to None.

type options:

str, optional

param pose_options:

Specific options for handling pose-related parameters during prediction. Defaults to None.

type pose_options:

str, optional

param overwrite:

If True, existing results will be overwritten. Defaults to False.

type overwrite:

bool, optional

param return_top_n_poses:

The number of top poses to return based on the prediction scores. Defaults to 1.

type return_top_n_poses:

int, optional

returns:

An object containing the results of the AlphaFold2 predictions, organized in a pandas DataFrame.

rtype:

RunnerOutput

raises FileNotFoundError:

If required files or directories are not found during the execution process.

raises ValueError:

If invalid arguments are provided to the methods.

raises TypeError:

If pose options are not of the expected type.

Examples

Here is an example of how to use the run method of the ColabFold class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from colabfold import ColabFold

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ColabFold class
colabfold = ColabFold()

# Run the prediction process
results = colabfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="inference.num_designs=10",
    pose_options=["inference.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details
  • Batch Processing: The method can handle large sets of input sequences by batching them into smaller groups, which helps in managing computational resources effectively.

  • Overwrite Handling: If overwrite is set to True, the method will clean up previous results, ensuring that the new predictions do not get mixed up with old data.

  • Job Starter Configuration: The method allows for flexible job management by accepting a JobStarter instance. If not provided, it uses the default job starter associated with the poses.

  • Score Collection: The method gathers the prediction scores and relevant data into a pandas DataFrame, facilitating easy analysis and integration with other ProtFlow components.

  • Error Handling: Robust error handling is incorporated to manage issues such as missing files or incorrect configurations, ensuring that the process can be debugged and verified efficiently.

Parameters:
Return type:

Poses

write_cmd(pose_path, output_dir, options=None, pose_options=None)[source]
write_cmd Method

The write_cmd method constructs the command string necessary to run the ColabFold script with the specified options and input files.

Detailed Description

This method generates the command string used to execute the ColabFold script. It incorporates various options and pose-specific parameters provided by the user.

param pose_path:

Path to the input FASTA file.

type pose_path:

str

param output_dir:

Directory where the prediction outputs will be stored.

type output_dir:

str

param options:

Additional options for the ColabFold script. Defaults to None.

type options:

str, optional

param pose_options:

Specific options for handling pose-related parameters. Defaults to None.

type pose_options:

str, optional

returns:

The constructed command string.

rtype:

str

raises None:

Examples

Here is an example of how to use the write_cmd method:

# Write the command to run ColabFold
cmd = colabfold.write_cmd(pose_path='/path/to/pose.fa', output_dir='/path/to/output_dir', options='--num_designs=10', pose_options='--input_pdb=input.pdb')
Parameters:
  • pose_path (str)

  • output_dir (str)

  • options (str)

  • pose_options (str)

Parameters:
protflow.tools.colabfold.calculate_poses_interaction_pae(prefix, poses, pae_list_col, binder_start, binder_end, target_start, target_end)[source]
Parameters:
  • prefix (str)

  • poses (Poses)

  • pae_list_col (str)

  • binder_start (int)

  • binder_end (int)

  • target_start (int)

  • target_end (int)

Return type:

Poses

protflow.tools.colabfold.collect_scores(work_dir, num_return_poses=1)[source]

collect_scores Method

The collect_scores method collects and processes the prediction scores from the ColabFold output, organizing them into a pandas DataFrame for further analysis.

Detailed Description

This method gathers the prediction scores from the output files generated by ColabFold. It processes these scores and organizes them into a structured DataFrame, which includes various statistical measures.

param work_dir:

The working directory where the ColabFold outputs are stored.

type work_dir:

str

param num_return_poses:

The number of top poses to return based on the prediction scores. Defaults to 1.

type num_return_poses:

int, optional

returns:

A DataFrame containing the collected and processed scores.

rtype:

pd.DataFrame

raises FileNotFoundError:

If no output files are found in the specified directory.

Examples

Here is an example of how to use the collect_scores method:

# Collect and process the prediction scores
scores = collect_scores(work_dir='/path/to/work_dir', num_return_poses=5)

Further Details

  • JSON and PDB File Parsing: The method identifies and parses JSON and PDB files generated by ColabFold, extracting relevant score information from these files.

  • Statistical Measures: For each set of predictions, the method calculates various statistics, including mean pLDDT, max PAE, and PTM scores, organizing these measures into the DataFrame.

  • Rank and Description: The scores are ranked and annotated with descriptions to help identify the top poses based on prediction quality.

  • File Handling: The method includes robust file handling to ensure that only the relevant files are processed, and any existing files are correctly identified or overwritten as needed.

  • Pose Location: The final DataFrame includes the file paths to the predicted PDB files, facilitating easy access for further analysis or visualization.

Parameters:
  • work_dir (str)

  • num_return_poses (int)

Return type:

DataFrame

protflow.tools.esmfold module

ESMFold Module

This module provides the functionality to integrate ESMFold within the ProtFlow framework. It offers tools to run ESMFold, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The ESMFold class encapsulates the functionality necessary to execute ESMFold runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of folding processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of ESMFold into larger computational workflows. It supports the automatic setup of job parameters, execution of ESMFold commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the ESMFold class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the folding process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the ESMFold class within a ProtFlow pipeline on a SLURM based queueing system:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from esmfold import ESMFold

# Create instances of necessary classes
poses = Poses()
jobstarter = SbatchArrayJobStarter(max_cores=10, gpus=1) # 1 gpu per node, 10 nodes at once.

# Initialize the ESMFold class
esmfold = ESMFold()

# Run the folding process
results = esmfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_batches=5,
    options="--additional_option=value",
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the folding process.

  • Customizability: Users can customize the folding process through multiple parameters, including the number of batches, specific options for the ESMFold script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate ESMFold into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.esmfold.ESMFold(python_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

ESMFold Class

The ESMFold class is a specialized class designed to facilitate the execution of ESMFold within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with ESMFold processes.

Detailed Description

The ESMFold class manages all aspects of running ESMFold simulations. It handles the configuration of necessary scripts and executables, prepares the environment for folding processes, and executes the folding commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to ESMFold scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of ESMFold commands with support for multiple batches.

  • Collecting and processing output data into a pandas DataFrame.

  • Preparing input FASTA files for ESMFold predictions and managing output directories.

rtype:

An instance of the `ESMFold class`, configured to run ESMFold processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises KeyError:

Examples

Here is an example of how to initialize and use the ESMFold class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from esmfold import ESMFold

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ESMFold class
esmfold = ESMFold()

# Run the folding process
results = esmfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_batches=5,
    options="--additional_option=value",
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the folding process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ESMFold class is intended for researchers and developers who need to perform ESMFold simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(python_path=None, pre_cmd=None, jobstarter=None)[source]

Initialize the ESMFold class with necessary configurations.

This method sets up the ESMFold class, configuring paths to essential scripts and Python executables, and setting up the environment for executing ESMFold processes.

Parameters:
  • python_path (str, optional) – The path to the Python executable used for running ESMFold. Defaults to the value specified in protflow.config.ESMFOLD_PYTHON_PATH.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • pre_cmd (str | None)

Raises:

ValueError – If no path is set for the ESMFold scripts or Python executable.

Return type:

None

Examples

Here is an example of how to initialize the ESMFold class:

from protflow.jobstarters import JobStarter
from esmfold import ESMFold

# Initialize the ESMFold class with default configurations
esmfold = ESMFold()

# Initialize the ESMFold class with a custom Python path and jobstarter
custom_python_path = "/path/to/custom/python"
jobstarter = JobStarter()
esmfold = ESMFold(python_path=custom_python_path, jobstarter=jobstarter)
Further Details:
  • Configuration: This method sets the script path for ESMFold inference and the Python path. If these are not correctly set in the configuration, a ValueError is raised.

  • Initialization: The method initializes necessary attributes, including the script path, Python path, jobstarter, and other configurations needed for running ESMFold processes.

This method prepares the ESMFold class for running folding simulations, ensuring that all necessary configurations and paths are correctly set up.

prep_fastas_for_prediction(poses, fasta_dir, max_filenum)[source]

Prepare input FASTA files for ESMFold predictions.

This method splits the input poses into the specified number of batches, prepares the FASTA files, and writes them to the specified directory for ESMFold predictions.

Parameters:
  • poses (list[str]) – List of paths to FASTA files.

  • fasta_dir (str) – Directory to which the new FASTA files should be written.

  • max_filenum (int) – Maximum number of FASTA files to be written.

Returns:

List of paths to the prepared FASTA files.

Return type:

list[str]

Examples

Here is an example of how to use the parse_fastas_for_prediction method:

from esmfold import ESMFold

# Initialize the ESMFold class
esmfold = ESMFold()

# Prepare FASTA files for prediction
fasta_paths = esmfold.parse_fastas_for_prediction(
    poses=["pose1.fa", "pose2.fa", "pose3.fa"],
    fasta_dir="/path/to/fasta_dir",
    max_filenum=2
)

# Access the prepared FASTA files
print(fasta_paths)
Further Details:
  • Input Preparation: The method merges and splits the input FASTA files into the specified number of batches. It ensures that the FASTA files are correctly formatted and written to the specified directory.

  • Customization: Users can specify the maximum number of FASTA files to be created, allowing for flexibility in managing input data for parallel processing.

  • Output Management: The method returns a list of paths to the newly created FASTA files, which are ready for ESMFold predictions.

This method is designed to facilitate the preparation of input data for ESMFold, ensuring that the input FASTA files are organized and ready for processing.

run(poses, prefix, jobstarter=None, options=None, overwrite=False, num_batches=None)[source]

Execute the ESMFold process with given poses and jobstarter configuration.

This method sets up and runs the ESMFold process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • options (str, optional) – Additional options for the ESMFold script. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • num_batches (int, optional) – The number of batches to split the input poses into for parallel processing. Defaults to None.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the ESMFold process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • KeyError – If forbidden options are included in the command options.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from esmfold import ESMFold

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ESMFold class
esmfold = ESMFold()

# Run the folding process
results = esmfold.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_batches=5,
    options="--additional_option=value",
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Input Preparation: The method prepares input FASTA files by splitting the input poses into the specified number of batches, optimizing the use of parallel computing resources.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized into a pandas DataFrame and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the folding process to their specific needs.

This method is designed to streamline the execution of ESMFold processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze folding simulations.

write_cmd(pose_path, output_dir, options)[source]

Write the command to run ESMFold with the given parameters.

This method constructs the command line instruction needed to execute the ESMFold script with the specified pose path, output directory, and additional options.

Parameters:
  • pose_path (str) – The path to the input FASTA file.

  • output_dir (str) – The directory where the ESMFold outputs will be stored.

  • options (str, optional) – Additional command-line options for the ESMFold script. Defaults to None.

Returns:

The constructed command line instruction for running ESMFold.

Return type:

str

Examples

Here is an example of how to use the write_cmd method:

from esmfold import ESMFold

# Initialize the ESMFold class
esmfold = ESMFold()

# Write the ESMFold command
cmd = esmfold.write_cmd(
    pose_path="/path/to/pose.fa",
    output_dir="/path/to/output_dir",
    options="--additional_option=value"
)

# Access the command
print(cmd)
Further Details:
  • Command Construction: The method parses additional options and constructs the full command line instruction needed to run the ESMFold script. It ensures that the mandatory arguments, such as the FASTA file and output directory, are correctly included.

  • Customization: Users can specify additional command-line options to customize the execution of the ESMFold script, allowing for flexibility in configuring the prediction process.

This method is designed to facilitate the execution of ESMFold by constructing the necessary command line instructions, ensuring that all required parameters and options are included.

Parameters:
protflow.tools.esmfold.collect_esmfold_scores(work_dir)[source]

Collect and process the scores from ESMFold output.

This method collects the JSON and PDB output files from ESMFold predictions, processes the data, and organizes it into a pandas DataFrame.

Parameters:
  • work_dir (str) – The working directory where ESMFold output files are stored.

  • scorefile (str) – The path to the JSON file where the collected scores will be saved.

Returns:

A DataFrame containing the collected scores and corresponding file locations.

Return type:

pd.DataFrame

Examples

Here is an example of how to use the collect_esmfold_scores method:

from esmfold import ESMFold

# Initialize the ESMFold class
esmfold = ESMFold()

# Collect scores from ESMFold output
scores_df = collect_esmfold_scores(
    work_dir="/path/to/work_dir",
    scorefile="/path/to/scorefile.json"
)

# Access the collected scores
print(scores_df)
Further Details:
  • Output Collection: The method scans the working directory for JSON and PDB output files, reads the JSON files into a DataFrame, and merges it with the locations of the PDB files.

  • Data Organization: The method organizes the collected data into a structured DataFrame, making it accessible for further analysis and ensuring that the scores are saved to the specified JSON file.

  • Output Management: The method also ensures that the temporary directories and files used during the prediction process are cleaned up, maintaining a tidy working environment.

This method is designed to streamline the collection and processing of ESMFold output data, ensuring that the results are organized and accessible for further analysis.

protflow.tools.ligandmpnn module

LigandMPNN Module

This module provides the functionality to integrate LigandMPNN within the ProtFlow framework. It offers tools to run LigandMPNN, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The LigandMPNN class encapsulates the functionality necessary to execute LigandMPNN runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of the diffusion processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of LigandMPNN into larger computational workflows. It supports the automatic setup of job parameters, execution of LigandMPNN commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the LigandMPNN class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the LigandMPNN class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from ligandmpnn import LigandMPNN

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Run the diffusion process
results = ligandmpnn.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    nseq=10,
    model_type="ligand_mpnn",
    options="some_option=some_value",
    pose_options=["pose_option=pose_value"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the process.

  • Customizability: Users can customize the process through multiple parameters, including the number of sequences, specific options for the LigandMPNN script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate LigandMPNN into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.ligandmpnn.LigandMPNN(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

LigandMPNN Class

The LigandMPNN class provides the necessary methods to execute LigandMPNN runs within the ProtFlow framework. This class is responsible for managing the configuration, execution, and output processing of LigandMPNN tasks.

Detailed Description

The LigandMPNN class integrates LigandMPNN into the ProtFlow pipeline by setting up the environment, running the diffusion process, and collecting the results. It ensures that the inputs and outputs are handled efficiently, making the data readily available for further analysis.

Key Features: - Manages paths to essential scripts and executables. - Configures and executes LigandMPNN processes. - Collects and processes output data into a structured DataFrame format. - Handles various edge cases and supports custom configurations through multiple parameters.

Usage

To use this class, initialize it with the appropriate script and Python paths, along with an optional job starter. The main functionality is provided through the run method, which requires parameters such as poses, prefix, and additional options for customization.

Example

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from ligandmpnn import LigandMPNN

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Run the diffusion process
results = ligandmpnn.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    nseq=10,
    model_type="ligand_mpnn",
    options="some_option=some_value",
    pose_options=["pose_option=pose_value"],
    overwrite=True
)

# Access and process the results
print(results)

Notes

This class is designed to work within the ProtFlow framework and assumes that the necessary configurations and dependencies are properly set up. It leverages shared data structures and configurations from ProtFlow to provide a seamless integration experience.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

__init__(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Initializes the LigandMPNN class.

Parameters:
  • script_path (str, optional) – The path to the LigandMPNN script. Defaults to the configured script path in ProtFlow.

  • python_path (str, optional) – The path to the Python executable to run the LigandMPNN script. Defaults to the configured Python path in ProtFlow.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job submissions. If not provided, it will use the default job starter configuration.

  • pre_cmd (str | None)

Return type:

None

Detailed Description

The __init__ method sets up the necessary paths and configurations for running LigandMPNN. It searches for the provided script and Python paths to ensure they are correct and sets them as instance attributes. Additionally, it initializes the job starter, which manages the execution of jobs in high-performance computing (HPC) environments. This method ensures that all configurations are correctly set up before running any LigandMPNN tasks.

check_for_batch_run(pose_options, pose_opt_cols)[source]

Checks if LigandMPNN can be run in batch mode.

This method determines whether the LigandMPNN process can be executed in batch mode. It does this by checking if pose-specific options are not provided and if only multi-residue columns are specified in the pose options.

Parameters:
  • pose_options (str) – Pose-specific options for the LigandMPNN script.

  • pose_opt_cols (dict) – Dictionary of pose-specific options for the LigandMPNN script.

Returns:

True if LigandMPNN can be run in batch mode, False otherwise.

Return type:

bool

Examples

Here is an example of how to use the check_for_batch_run method:

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Check for batch run
can_batch_run = ligandmpnn.check_for_batch_run(
    pose_options=None,
    pose_opt_cols={"fixed_residues": "fixed_res_col"}
)

print(can_batch_run)  # Outputs: True or False
Further Details:
  • Batch Mode Check: The method checks if the pose_options is None and if the pose_opt_cols contains only multi-residue columns, which are necessary for batch processing.

multi_cols_only(pose_opt_cols)[source]

checks if only multi_res cols are in pose_opt_cols dict. Only _multi arguments can be used for ligandmpnn_batch runs.

Parameters:

pose_opt_cols (dict)

Return type:

bool

parse_pose_opt_cols(poses, output_dir, pose_opt_cols=None)[source]

Parses pose-specific options columns into pose options formatted strings.

This method processes the pose_opt_cols dictionary and converts its contents into a format that can be used as part of the LigandMPNN pose options. It ensures that the options are properly structured and, if necessary, writes specific arguments into JSON files.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • output_dir (str) – The directory where JSON files for multi-residue options will be saved.

  • pose_opt_cols (dict, optional) – Dictionary of pose-specific options for the LigandMPNN script. Defaults to None.

Returns:

A list of dictionaries containing the parsed pose options formatted as strings.

Return type:

list[dict]

Raises:

ValueError – If both fixed_residues and redesigned_residues are defined in pose_opt_cols, or if specified columns do not exist in poses.df.

Examples

Here is an example of how to use the parse_pose_opt_cols method:

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Example Poses object and pose_opt_cols
poses = Poses()
pose_opt_cols = {
    "bias_AA_per_residue": "bias_col",
    "fixed_residues": "fixed_res_col"
}

# Parse pose options
parsed_opts = ligandmpnn.parse_pose_opt_cols(
    poses=poses,
    output_dir="/path/to/output",
    pose_opt_cols=pose_opt_cols
)

print(parsed_opts)  # Outputs the parsed pose options
Further Details:
  • Option Parsing: The method converts the pose_opt_cols dictionary into a list of strings formatted as pose options. It handles various types of options, including those that need to be written into JSON files and those that can be parsed directly from residue selections.

  • JSON Directory Setup: If necessary, the method sets up a directory for storing JSON files that contain mappings for multi-residue options.

  • Error Handling: The method includes checks to ensure that incompatible options are not specified simultaneously and that all specified columns exist in the poses DataFrame.

run(poses, prefix, jobstarter=None, nseq=1, model_type=None, options=None, pose_options=None, fixed_res_col=None, design_res_col=None, pose_opt_cols=None, return_seq_threaded_pdbs_as_pose=False, preserve_original_output=False, overwrite=False)[source]

Execute the LigandMPNN process with given poses and jobstarter configuration.

This method sets up and runs the LigandMPNN process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • nseq (int, optional) – The number of sequences to generate for each input pose. Defaults to 1.

  • model_type (str, optional) – The type of model to use. Defaults to ‘ligand_mpnn’.

  • options (str, optional) – Additional options for the LigandMPNN script. Defaults to None.

  • pose_options (object, optional) – Pose-specific options for the LigandMPNN script. Defaults to None.

  • fixed_res_col (str, optional) – Column name in the poses DataFrame specifying fixed residues. Defaults to None.

  • design_res_col (str, optional) – Column name in the poses DataFrame specifying residues to be redesigned. Defaults to None.

  • pose_opt_cols (dict, optional) – Dictionary of pose-specific options for the LigandMPNN script. Defaults to None.

  • return_seq_threaded_pdbs_as_pose (bool, optional) – If True, return sequence-threaded PDBs as poses. Defaults to False.

  • preserve_original_output (bool, optional) – If True, preserve the original output files. Defaults to True.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

The updated Poses object containing the results of the LigandMPNN process.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from ligandmpnn import LigandMPNN

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Run the diffusion process
results = ligandmpnn.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    nseq=10,
    model_type="ligand_mpnn",
    options="some_option=some_value",
    pose_options=["pose_option=pose_value"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the process to their specific needs.

This method is designed to streamline the execution of LigandMPNN processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein design simulations.

setup_batch_run(cmds, num_batches, output_dir)[source]

Concatenates commands for MPNN into batches so that MPNN does not have to be loaded individually for each PDB file.

This method prepares the LigandMPNN commands for batch execution. It concatenates the commands into batches to optimize the running process by reducing the overhead of loading the MPNN model multiple times.

Parameters:
  • cmds (list[str]) – A list of commands to run LigandMPNN.

  • num_batches (int) – The number of batches to split the commands into.

  • output_dir (str) – The directory where the batch input JSON files will be saved.

Returns:

A list of concatenated batch commands.

Return type:

list[str]

Examples

Here is an example of how to use the setup_batch_run method:

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Example commands
cmds = [
    "/path/to/python /path/to/run.py --option1=value1 --pdb_path=path1.pdb",
    "/path/to/python /path/to/run.py --option2=value2 --pdb_path=path2.pdb",
    # More commands...
]

# Setup batch run
batch_cmds = ligandmpnn.setup_batch_run(
    cmds=cmds,
    num_batches=2,
    output_dir="/path/to/output"
)

print(batch_cmds)  # Outputs the batch commands
Further Details:
  • Batch Command Setup: The method splits the provided commands into sublists based on the number of batches. It then processes each sublist to handle multi-residue options and generate corresponding JSON files.

  • JSON Directory: The method sets up a directory for storing JSON files that contain mappings for multi-residue options.

  • Command Concatenation: Each command sublist is processed to extract and convert multi-residue options into JSON files, which are then referenced in the batch commands.

write_cmd(pose_path, output_dir, model, nseq, options, pose_options)[source]

Writes the command to run ligandmpnn.py.

This method constructs the command necessary to run the LigandMPNN script, incorporating various options and parameters. It ensures that the command is correctly formatted and includes all required arguments.

Parameters:
  • pose_path (str) – The path to the input PDB file for the pose.

  • output_dir (str) – The directory where the output files will be saved.

  • model (str) – The type of model to use (e.g., “ligand_mpnn”).

  • nseq (int) – The number of sequences to generate for each input pose. Defaults to 1.

  • options (str) – Additional options for the LigandMPNN script.

  • pose_options (str) – Pose-specific options for the LigandMPNN script.

Returns:

The constructed command string to run LigandMPNN.

Return type:

str

Raises:

ValueError – If the specified model is not one of the available models.

Examples

Here is an example of how to use the write_cmd method:

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Write the command
cmd = ligandmpnn.write_cmd(
    pose_path="path/to/input.pdb",
    output_dir="path/to/output",
    model="ligand_mpnn",
    nseq=10,
    options="some_option=some_value",
    pose_options="pose_option=pose_value"
)

print(cmd)  # Outputs the constructed command string
Further Details:
  • Model Validation: The method checks if the specified model is among the available models and raises an error if it is not.

  • Option Parsing: The method parses generic options and pose-specific options, ensuring that necessary safety checks and defaults are applied.

  • Command Construction: The method assembles the final command string, including paths, model checkpoints, options, and other necessary parameters.

Parameters:
  • script_path (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (JobStarter)

protflow.tools.ligandmpnn.collect_scores(work_dir, return_seq_threaded_pdbs_as_pose, preserve_original_output=True, pack_sidechains=False)[source]

Collects scores from the LigandMPNN output.

This method processes the output files generated by LigandMPNN, including multi-sequence FASTA files and PDB files. It reads, renames, and organizes these files into a structured DataFrame.

Parameters:
  • work_dir (str) – The directory where LigandMPNN output files are located.

  • return_seq_threaded_pdbs_as_pose (bool) – If True, replaces FASTA files with sequence-threaded PDB files as poses.

  • preserve_original_output (bool, optional) – If True, preserves the original output files. Defaults to True.

  • pack_sidechains (bool)

Returns:

A DataFrame containing the collected scores and relevant data from the LigandMPNN output.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If required output files are not found in the specified directory.

Examples

Here is an example of how to use the collect_scores method:

# Initialize the LigandMPNN class
ligandmpnn = LigandMPNN()

# Collect scores from the output directory
scores = ligandmpnn.collect_scores(
    work_dir="/path/to/output",
    return_seq_threaded_pdbs_as_pose=True,
    preserve_original_output=False
)

print(scores)  # Outputs the collected scores DataFrame
Further Details:
  • Output Processing: The method reads and parses multi-sequence FASTA files, converts sequences into a structured dictionary, and writes new FASTA files if necessary.

  • File Management: Original output files are copied to dedicated directories, and new files are generated and organized for easy access. Optionally, original files can be preserved or deleted based on the preserve_original_output parameter.

  • Error Handling: The method includes checks to ensure that required output files are present, raising errors if files are missing or paths are incorrect.

protflow.tools.ligandmpnn.create_distance_conservation_bias_cmds(poses, prefix, center, shell_distances=[10, 15, 20, 1000], shell_biases=[0, 0.25, 0.5, 1], center_atoms=None, noncenter_atoms=['CA'], jobstarter=None, overwrite=False)[source]

Creates distance-based conservation bias commands for LigandMPNN runs and saves them in a poses DataFrame column.

This function creates commands for conservation bias based on shells with a distance from a given ResidueSelection.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used as output folder and column name in the poses DataFrame to save the commands.

  • center (str or ResidueSelection) – The center of the shells. Can be either a single ResidueSelection or a poses DataFrame column containing ResidueSelections.

  • shell_distances (list, optional) – The shells for creating conservation bias. The numbers represent the distance from the center. Defaults to [10, 15, 20, 100].

  • shell_biases (list, optional) – The strength of the bias for each shell. Defaults to [0, 0.25, 0.5, 1].

  • center_atoms (list, optional) – The atom names of the center ResidueSelection which should be used for shell distance calculations. None means all atoms are selected. Defaults to None.

  • noncenter_atoms (list, optional) – The atom names of noncenter residues which should be used for shell distance calculations. None means all atoms are selected. Defaults to [“CA”].

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

The updated Poses object containing the commands for conservation bias in a poses DataFrame column.

Return type:

Poses

Raises:

KeyError – If shell_distances are not sorted in ascending order.

Examples

Here is an example of how to use the create_distance_conservation_bias_cmds method:

from protflow.poses import Poses
from protflow.jobstarters import LocalJobStarter
from protflow.residue_selectors import ResidueSelection
from ligandmpnn import create_distance_conservation_bias_cmds

# Create instances of necessary classes
poses = Poses(poses=".", glob_suffix="*.pdb)
jobstarter = LocalJobStarter()
central_selection = ResidueSelection("A23")

# Run the diffusion process
poses = create_distance_conservation_bias_cmds(
    poses=poses,
    prefix="prefix",
    prefix="conservation_bias_cmd",
    jobstarter=jobstarter,
    center=central_selection,
)

# Access and process the results
print(poses.df["prefix"])
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the process to their specific needs.

This method is designed to streamline the creation of distance-based conservation bias commands for LigandMPNN within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein design simulations.

protflow.tools.ligandmpnn.parse_residues(residues)[source]

Parses residues from either ResidueSelection object, list, or MPNN-formatted string into MPNN-formatted string.

This function converts the input residues into a format compatible with MPNN. It supports conversion from ResidueSelection objects, comma-separated strings, and lists of residues.

Parameters:

residues (object) – The input residues to be parsed. This can be a ResidueSelection object, a comma-separated string, or a list of residues.

Returns:

The residues formatted as a string compatible with MPNN.

Return type:

str

Raises:

ValueError – If the input type is not supported (i.e., not a str or ResidueSelection).

Examples

Here is an example of how to use the parse_residues function:

from protflow.residues import ResidueSelection

# Example ResidueSelection object
residues = ResidueSelection(["A:10", "A:20"])

# Parse residues
parsed_residues = parse_residues(residues)
print(parsed_residues)  # Outputs: "A:10 A:20"

# Example string input
residues_str = "A:10,A:20"
parsed_residues = parse_residues(residues_str)
print(parsed_residues)  # Outputs: "A:10 A:20"
Further Details:
  • ResidueSelection Object: The function calls the to_string method of the ResidueSelection object to get the MPNN-formatted string.

  • String Input: For comma-separated string inputs, the function splits the string by commas and joins the parts with spaces.

protflow.tools.ligandmpnn.write_to_json(input_dict, output_path)[source]

Writes json serializable :input_dict: into file and returns path to file. Returns path to json file :output_path:

Parameters:
  • input_dict (dict)

  • output_path (str)

Return type:

str

protflow.tools.protein_edits module

ProteinEdits Module

This module provides the functionality to handle various protein editing tasks within the ProtFlow framework. It offers tools to add and remove protein chains, add sequences to proteins, and multimerize sequences in a structured and automated manner.

Detailed Description

The protein_edits module contains classes and methods designed to perform common protein editing operations. The ChainAdder class provides methods for adding chains to protein structures, including functionality for superimposing chains based on motifs or existing chains. The ChainRemover class allows for the removal of specified chains from protein structures. Additionally, methods for adding sequences to proteins and creating multimers from sequences are included, streamlining the process of preparing protein structures for further analysis.

The module integrates seamlessly with the ProtFlow ecosystem, leveraging shared configurations, job management capabilities, and data structures to provide a cohesive user experience. It supports automatic setup and execution of jobs, handling of input and output files, and robust error handling and logging.

Usage

To use this module, create instances of the ChainAdder or ChainRemover classes and invoke their respective methods with appropriate parameters. The module handles the configuration, execution, and result collection processes, allowing users to focus on interpreting the results.

Examples

Here is an example of how to initialize and use the ChainAdder and ChainRemover classes within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainAdder, ChainRemover

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainAdder class
chain_adder = ChainAdder(jobstarter=jobstarter)

# Add a chain to the poses
added_chains = chain_adder.add_chain(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_column",
    copy_chain="A",
    jobstarter=jobstarter,
    overwrite=True
)

# Initialize the ChainRemover class
chain_remover = ChainRemover(jobstarter=jobstarter)

# Remove a chain from the poses
removed_chains = chain_remover.remove_chains(
    poses=poses,
    prefix="experiment_2",
    chains=["A"],
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(added_chains)
print(removed_chains)

Further Details

  • Edge Cases: The module handles various edge cases, such as missing chain specifications and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the process.

  • Customizability: Users can customize the processes through multiple parameters, including the chain to add or remove, sequence details for adding sequences, and the number of protomers for multimerization.

  • Integration: The module integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate protein editing tasks into their computational workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

class protflow.tools.protein_edits.ChainAdder(python=None, jobstarter=None)[source]

Bases: Runner

ChainAdder Class

The ChainAdder class is a specialized class designed to facilitate the addition of chains to protein structures within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with chain addition processes.

Detailed Description

The ChainAdder class manages all aspects of adding chains to protein structures. It configures necessary scripts and executables, prepares the environment for the addition processes, and executes the required commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to chain addition scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of chain addition commands with support for superimposition on motifs or existing chains.

  • Collecting and processing output data into a structured format.

  • Providing methods for adding sequences to proteins and creating multimers from sequences.

rtype:

An instance of the `ChainAdder class`, configured to add chains to protein structures and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the ChainAdder class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainAdder

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainAdder class
chain_adder = ChainAdder(jobstarter=jobstarter)

# Add a chain to the poses
added_chains = chain_adder.add_chain(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_column",
    copy_chain="A",
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(added_chains)

Further Details

  • Edge Cases: The class handles various edge cases, such as missing chain specifications and the need to overwrite previous results.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the chain addition process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ChainAdder class is intended for researchers and developers who need to add chains to protein structures as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(python=None, jobstarter=None)[source]

Initialize the ChainAdder class.

This method sets up the ChainAdder class by configuring the path to the default Python executable and initializing the job starter. The ChainAdder class is used to add chains to protein structures within the ProtFlow framework.

Parameters:
  • python (str, optional) – The path to the default Python executable, by default os.path.join(PROTFLOW_ENV, “python3”).

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job execution, by default None.

python

Path to the Python executable used for running scripts.

Type:

str

jobstarter

An instance of the JobStarter class to manage job execution.

Type:

JobStarter

Examples

Here is an example of how to initialize the ChainAdder class:

from protflow.jobstarters import JobStarter
from protein_edits import ChainAdder

# Initialize the ChainAdder class
jobstarter = JobStarter()
chain_adder = ChainAdder(jobstarter=jobstarter)

Notes

The ChainAdder class depends on the ProtFlow environment being properly configured. Ensure that the PROTFLOW_ENV and necessary scripts are correctly set up before using this class.

Raises:

FileNotFoundError – If the specified Python executable is not found.

Parameters:
add_chain(poses, prefix, ref_col, copy_chain, jobstarter=None, overwrite=False)[source]

Add a chain to the poses.

This method adds a specified chain to the protein structures in poses by using the superimpose_add_chain method without any superimposition, effectively copying the chain as-is.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • ref_col (str) – The column in the poses DataFrame that references the structures to be used.

  • copy_chain (str) – The chain identifier to copy.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job execution. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing outputs. Defaults to False.

Returns:

An updated Poses object with the new chain added.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the methods.

  • TypeError – If invalid argument types are provided to the methods.

Examples

Here is an example of how to initialize and use the add_chain method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainAdder

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainAdder class
chain_adder = ChainAdder(jobstarter=jobstarter)

# Add a chain to the poses
added_chains = chain_adder.add_chain(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_column",
    copy_chain="A",
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(added_chains)
Further Details
  • Method Simplicity: This method uses superimpose_add_chain without specifying any superimposition parameters, making it a straightforward way to add chains without the complexity of superimposition.

  • Path Configuration: Ensure the paths to the scripts and executables are correctly configured as per ProtFlow setup. Using default paths is recommended unless customization is necessary.

  • JobStarter Integration: The JobStarter object is used to manage job execution, ensuring processes are handled efficiently. If a JobStarter is not provided, the method will not operate without it.

add_sequence(prefix, poses, seq=None, seq_col=None, sep=':')[source]

Add a sequence to the poses in .fa format.

This method appends a specified sequence to the protein sequences in the poses object. The sequence can be provided directly or specified through a column in the poses DataFrame. The updated sequences are saved in .fa format in a specified directory.

Parameters:
  • prefix (str) – A prefix used to name and organize the output files.

  • poses (Poses) – The Poses object containing the protein structures.

  • seq (str, optional) – The sequence to be added. If specified, seq_col must be None. Defaults to None.

  • seq_col (str, optional) – The column in the poses DataFrame that contains the sequences to be added. If specified, seq must be None. Defaults to None.

  • sep (str, optional) – The separator to be used between the original and new sequences. Defaults to “:”.

Raises:

ValueError – If poses are not in .fa or .fasta format, if both seq and seq_col are specified, or if neither seq nor seq_col is specified.

Return type:

None

Examples

Here is an example of how to use the add_sequence method:

from protflow.poses import Poses
from protein_edits import ChainAdder

# Create instances of necessary classes
poses = Poses()

# Initialize the ChainAdder class
chain_adder = ChainAdder()

# Add a sequence to the poses
chain_adder.add_sequence(
    prefix="experiment_1",
    poses=poses,
    seq="ATCGATCGATCG",
    sep=":"
)
Further Details
  • File Format: The method checks that all poses are in .fa or .fasta format and raises an error if not.

  • Sequence Input: Either seq or seq_col must be specified to provide the sequence to be added.

    The method ensures that both are not specified simultaneously.

  • Output Directory: The method creates an output directory if it does not exist and saves the updated

    sequences in this directory.

  • DataFrame Update: The poses DataFrame is updated to reflect the new locations of the modified sequences.

multimerize(prefix, poses, n_protomers, sep=':')[source]

Create multimers from the sequences in .fa files.

This method takes .fa files from the poses object and creates multimers by repeating the sequence a specified number of times. The updated sequences are saved in .fa format in a specified directory.

Parameters:
  • prefix (str) – A prefix used to name and organize the output files.

  • poses (Poses) – The Poses object containing the protein structures.

  • n_protomers (int) – The number of protomers in the final .fa file.

  • sep (str, optional) – The separator to be used between the original and new sequences. Defaults to “:”.

Raises:

ValueError – If poses are not in .fa or .fasta format.

Return type:

None

Examples

Here is an example of how to use the multimerize method:

from protflow.poses import Poses
from protein_edits import ChainAdder

# Create instances of necessary classes
poses = Poses()

# Initialize the ChainAdder class
chain_adder = ChainAdder()

# Multimerize the sequences in the poses
chain_adder.multimerize(
    prefix="experiment_1",
    poses=poses,
    n_protomers=3,
    sep=":"
)
Further Details
  • File Format: The method checks that all poses are in .fa or .fasta format and raises an error if not.

  • Protomers Specification: The n_protomers parameter specifies the number of times the sequence should be repeated to form a multimer.

  • Output Directory: The method creates an output directory if it does not exist and saves the updated sequences in this directory.

  • DataFrame Update: The poses DataFrame is updated to reflect the new locations of the modified sequences.

parse_motif(motif, pose)[source]

Set up motif from target_motif input.

This method converts a given motif, either a ResidueSelection object or a string, into a string format suitable for further processing. If the motif is a string, it checks if it is a column in the pose DataFrame and assumes it points to a ResidueSelection object.

Parameters:
  • motif (ResidueSelection | str) – The motif to be parsed. It can be either a ResidueSelection object or a string.

  • pose (pd.Series) – A row from the poses DataFrame that contains information about the protein structure.

Returns:

The motif in string format.

Return type:

str

Raises:
  • ValueError – If the motif is a string but not a column in the poses.df DataFrame.

  • TypeError – If the motif is neither a ResidueSelection object nor a string.

Examples

Here is an example of how to use the parse_motif method:

from protflow.residues import ResidueSelection
from protein_edits import ChainAdder
import pandas as pd

# Initialize the ChainAdder class
chain_adder = ChainAdder()

# Example pose DataFrame row
pose = pd.Series({'motif_column': ResidueSelection(...)})

# Parse a ResidueSelection object
motif = ResidueSelection(...)
motif_str = chain_adder.parse_motif(motif, pose)

# Parse a string that is a column in the pose DataFrame
motif_str = chain_adder.parse_motif('motif_column', pose)

# Access the result
print(motif_str)
Further Details
  • ResidueSelection Handling: The method directly converts a ResidueSelection object to its string representation using its to_string method.

  • String Handling: If a string is provided, the method checks if it is a column in the pose DataFrame that points to a ResidueSelection object, converting it to a string.

  • Error Handling: The method raises appropriate errors if the input is not of the expected type or if the string does not correspond to a valid column in the DataFrame.

run(poses, prefix, jobstarter)[source]

.run() not implemented for ChainAdder class. Use methods like: .add_chain() or .superimpose_add_chain() instead!!!

superimpose_add_chain(poses, prefix, ref_col, copy_chain, jobstarter=None, target_motif=None, reference_motif=None, target_chains=None, reference_chains=None, translate_x=None, overwrite=False)[source]

Add a protein chain after superimposition on a motif or chain.

This method adds a chain to the protein structures in poses by superimposing it on a specified motif or chain. It sets up and executes the necessary scripts, handles the environment configuration, and processes the output.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • ref_col (str) – The column in the poses DataFrame that references the structures to be used.

  • copy_chain (str) – The chain identifier to copy.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job execution. Defaults to None.

  • target_motif (ResidueSelection, optional) – The target motif for superimposition. Defaults to None.

  • reference_motif (ResidueSelection, optional) – The reference motif for superimposition. Defaults to None.

  • target_chains (list, optional) – A list of target chains for superimposition. Defaults to None.

  • reference_chains (list, optional) – A list of reference chains for superimposition. Defaults to None.

  • translate_x (float, optional) – Translate the chain to copy by x Angstrom in x-axis. This option can e.g. be used to set up multi-state design with LigandMPNN.

  • overwrite (bool, optional) – If True, overwrite existing outputs. Defaults to False.

Returns:

An updated Poses object with the new chain added.

Return type:

Poses

Raises:
  • ValueError – If both motifs and chains are specified for superimposition.

  • FileNotFoundError – If required files or directories are not found during the execution process.

  • TypeError – If invalid argument types are provided to the methods.

Examples

Here is an example of how to initialize and use the superimpose_add_chain method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainAdder

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainAdder class
chain_adder = ChainAdder(jobstarter=jobstarter)

# Add a chain to the poses
added_chains = chain_adder.superimpose_add_chain(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_column",
    copy_chain="A",
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(added_chains)
Further Details
  • Path Configuration: Ensure the paths to the scripts and executables are correctly configured as per ProtFlow setup. Using default paths is recommended unless customization is necessary.

  • JobStarter Integration: The JobStarter object is used to manage job execution, ensuring processes are handled efficiently. If a JobStarter is not provided, the method will not operate without it.

Notes

This method ensures robust error handling and logging for easier debugging and verification of the process.

Parameters:
class protflow.tools.protein_edits.ChainRemover(python=None, jobstarter=None)[source]

Bases: Runner

ChainRemover Class

The ChainRemover class is a specialized class designed to facilitate the removal of chains from protein structures within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with chain removal processes.

Detailed Description

The ChainRemover class manages all aspects of removing chains from protein structures. It configures necessary scripts and executables, prepares the environment for the removal processes, and executes the required commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to chain removal scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of chain removal commands with support for batch processing.

  • Collecting and processing output data into a structured format.

rtype:

An instance of the `ChainRemover class`, configured to remove chains from protein structures and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

Examples

Here is an example of how to initialize and use the ChainRemover class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainRemover

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainRemover class
chain_remover = ChainRemover(jobstarter=jobstarter)

# Remove a chain from the poses
removed_chains = chain_remover.remove_chains(
    poses=poses,
    prefix="experiment_2",
    chains=["A"],
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(removed_chains)

Further Details

  • Edge Cases: The class handles various edge cases, such as missing chain specifications and the need to overwrite previous results.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the chain removal process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ChainRemover class is intended for researchers and developers who need to remove chains from protein structures as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(python=None, jobstarter=None)[source]

Initialize the ChainRemover class.

This method sets up the ChainRemover class by configuring the path to the default Python executable and initializing the job starter. The ChainRemover class is used to remove chains from protein structures within the ProtFlow framework.

Parameters:
  • python (str, optional) – The path to the default Python executable. Defaults to PROTFLOW_ENV.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job execution. Defaults to None.

python

Path to the Python executable used for running scripts.

Type:

str

jobstarter

An instance of the JobStarter class to manage job execution.

Type:

JobStarter

Examples

Here is an example of how to initialize the ChainRemover class:

from protflow.jobstarters import JobStarter
from protein_edits import ChainRemover

# Initialize the ChainRemover class
jobstarter = JobStarter()
chain_remover = ChainRemover(jobstarter=jobstarter)

Notes

The ChainRemover class depends on the ProtFlow environment being properly configured. Ensure that the PROTFLOW_ENV and necessary scripts are correctly set up before using this class.

Raises:

FileNotFoundError – If the specified Python executable is not found.

Parameters:
run(poses, prefix, jobstarter=None, chains=None, preserve_chains=None, overwrite=False)[source]

Remove chains from the poses.

This method removes specified chains from the protein structures in the poses object. It sets up and executes the necessary scripts, handles the environment configuration, and processes the output.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • chains (list, optional) – A list of chains to be removed. If specified, each chain in the list will be removed from the poses. Defaults to None.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class to manage job execution. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing outputs. Defaults to False.

  • preserve_chains (list)

Returns:

An updated Poses object with the specified chains removed.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the methods.

Examples

Here is an example of how to initialize and use the remove_chains method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_edits import ChainRemover

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ChainRemover class
chain_remover = ChainRemover(jobstarter=jobstarter)

# Remove chains from the poses
removed_chains = chain_remover.remove_chains(
    poses=poses,
    prefix="experiment_2",
    chains=["A"],
    jobstarter=jobstarter,
    overwrite=True
)

# Access and process the results
print(removed_chains)
Further Details
  • Output Checking: The method checks if the output already exists and whether it should be overwritten, ensuring no redundant processing.

  • Chain Setup: Chains can be specified as a list, a column in the poses DataFrame, or as a single chain identifier for all poses.

  • Batch Processing: The method supports batch processing, splitting the inputs into sublists to optimize resource usage during execution.

  • Path Configuration: Ensure the paths to the scripts and executables are correctly configured as per ProtFlow setup. Using default paths is recommended unless customization is necessary.

  • JobStarter Integration: The JobStarter object is used to manage job execution, ensuring processes are handled efficiently. If a JobStarter is not provided, the method will operate without it, but using one is recommended for better job management.

Parameters:
class protflow.tools.protein_edits.SequenceAdder(sequence=None, sequence_col=None, python=None, jobstarter=None)[source]

Bases: Runner

ProtFlow Runner to add sequences to .fasta files. (useful for predicting complexes and so on.)

Parameters:
__init__(sequence=None, sequence_col=None, python=None, jobstarter=None)[source]

Parameters: sequence: Either string of sequence that should be added. sequence_col: column in poses.df that contains the sequences to be added. sequence and sequence_col are mutually exclusive.

Parameters:
run(poses, prefix, jobstarter=None, sequence=None, sequence_col=None, insert_idx=-1, overwrite=False)[source]

Parameters: chains: can either be a list that contains chain idx to drop, or a str that points to the column in poses.df that contains this list for every pose.

Parameters:
Return type:

Poses

class protflow.tools.protein_edits.SequenceRemover(chains=None, sep=None, python=None, jobstarter=None)[source]

Bases: Runner

Runner Class to remove sequences from .fasta files

Parameters:
__init__(chains=None, sep=None, python=None, jobstarter=None)[source]

Parameters: chains: list of chain idx to remove.

Parameters:
run(poses, prefix, jobstarter=None, chains=None, sep=None, overwrite=False)[source]

Parameters: chains: can either be a list that contains chain idx to drop, or a str that points to the column in poses.df that contains this list for every pose.

Parameters:
Return type:

Poses

protflow.tools.protein_edits.parse_chain(chain, pose)[source]

Sets up chain for add_chains_batch.py

Parameters:

pose (Series)

Return type:

str

protflow.tools.protein_edits.setup_chain_list(chain_arg, poses)[source]

Set up chains for add_chains_batch.py.

This function configures the list of chains to be used in the add_chains_batch.py script based on the provided chain_arg. It supports specifying a single chain, a column in the poses DataFrame, or a list of chains.

Parameters:
  • chain_arg (str or list[str]) – The chain specification. It can be a single chain identifier (e.g., ‘A’), the name of a column in the poses DataFrame where the chains are listed, or a list of chain identifiers.

  • poses (Poses) – The Poses object containing the protein structures.

Returns:

A list of chain identifiers to be used in add_chains_batch.py.

Return type:

list[str]

Raises:

ValueError – If the chain_arg value is inappropriate, such as when the specified column does not exist in the DataFrame or the length of the list does not match the number of poses.

Examples

Here is an example of how to use the setup_chain_list function:

from protflow.poses import Poses
from protein_edits import setup_chain_list

# Create instances of necessary classes
poses = Poses()

# Set up a single chain
chain_list = setup_chain_list('A', poses)
print(chain_list)

# Set up chains from a column in the poses DataFrame
chain_list = setup_chain_list('chain_col', poses)
print(chain_list)

# Set up chains from a list
chain_list = setup_chain_list(['A', 'B', 'C'], poses)
print(chain_list)

Further Details

  • Single Chain Identifier: If a single chain identifier (e.g., ‘A’) is provided, it is used for all poses.

  • DataFrame Column: If the name of a column in the poses DataFrame is provided, the function extracts the chain identifiers

    from that column for each pose.

  • List of Chains: If a list of chain identifiers is provided, it must match the length of the poses DataFrame.

    The function raises an error if this condition is not met.

protflow.tools.protein_generator module

ProteinGenerator Module

This module provides the functionality to integrate the protein generation process within the ProtFlow framework. It offers tools to run the protein_generator script, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The ProteinGenerator class encapsulates the functionality necessary to run protein_generator as described in the publication. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of protein generation processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of protein generation into larger computational workflows. It supports the automatic setup of job parameters, execution of protein generator commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the ProteinGenerator class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the protein generation process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the ProteinGenerator class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_generator import ProteinGenerator

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator()

# Run the protein generation process
results = protein_generator.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="generation.num_proteins=10",
    pose_options=["generation.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the protein generation process.

  • Customizability: Users can customize the protein generation process through multiple parameters, including the number of generated proteins, specific options for the protein generator script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate protein generation into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.protein_generator.ProteinGenerator(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

ProteinGenerator Class

The ProteinGenerator class is a specialized class designed to facilitate the execution of protein generation within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with protein generation processes.

Detailed Description

The ProteinGenerator class manages all aspects of running protein generation simulations. It handles the configuration of necessary scripts and executables, prepares the environment for protein generation processes, and executes the generation commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to protein generation scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of protein generation commands with support for multiple generation runs.

  • Collecting and processing output data into a pandas DataFrame.

  • Ensuring robust error handling and logging for easier debugging and verification of the generation process.

rtype:

An instance of the `ProteinGenerator class`, configured to run protein generation processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the ProteinGenerator class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_generator import ProteinGenerator

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator()

# Run the protein generation process
results = protein_generator.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="generation.num_proteins=10",
    pose_options=["generation.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the protein generation process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ProteinGenerator class is intended for researchers and developers who need to perform protein generation as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Initialize the ProteinGenerator class with paths to the necessary scripts and Python executable.

This constructor sets up the ProteinGenerator class, configuring the paths to the protein generator script and the Python executable. It also sets the jobstarter and initializes essential attributes for the class.

Parameters:
  • script_path (str, optional) – The path to the protein generator script. Defaults to the value set in config.PROTEIN_GENERATOR_SCRIPT_PATH.

  • python_path (str, optional) – The path to the Python executable. Defaults to the value set in config.PROTEIN_GENERATOR_PYTHON_PATH.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • pre_cmd (str | None)

Raises:

ValueError – If no script_path is provided or set in the configuration.

Return type:

None

Examples

Here is an example of how to initialize the ProteinGenerator class:

from protflow.jobstarters import JobStarter
from protein_generator import ProteinGenerator

# Initialize the JobStarter class
jobstarter = JobStarter()

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator(
    script_path="/path/to/protein_generator.py",
    python_path="/path/to/python",
    jobstarter=jobstarter
)

print(protein_generator)
Further Details:
  • Script Path: The path to the protein generator script is a critical configuration that needs to be set either through the parameter or the configuration file.

  • Python Path: The path to the Python executable is necessary for running the script and should be set accordingly.

  • JobStarter: If provided, the JobStarter instance is used to manage job execution, otherwise it can be set later.

collect_scores(scores_dir)[source]

Collect scores from the protein_generator output directory.

This method reads the output .pdb files generated by the protein generator and parses the corresponding .trb files into a pandas DataFrame. It consolidates the scores from multiple files into a single DataFrame for further analysis.

Parameters:

scores_dir (str) – The directory where the output files from the protein generator are stored.

Returns:

A DataFrame containing the collected scores from the protein generator output files.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If no .pdb files are found in the specified output directory, indicating a possible issue with the protein generator execution or an incorrect path.

Examples

Here is an example of how to use the collect_scores method:

from protein_generator import ProteinGenerator

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator()

# Collect scores from the output directory
scores_df = protein_generator.collect_scores(scores_dir="output_directory")

print(scores_df)
Further Details:
  • File Reading: The method reads all .pdb files from the specified directory. If no .pdb files are found, a FileNotFoundError is raised.

  • Data Parsing: The method parses the corresponding .trb files for each .pdb file, converting them into pandas DataFrames and concatenating them into a single DataFrame.

  • Output Organization: The resulting DataFrame is organized and returned for further analysis, with all scores consolidated from the multiple output files.

parse_trbfile(trbfile)[source]

Read a protein_generator output .trb file and parse the scores into a pandas DataFrame.

This method reads the specified .trb file, extracts relevant data, and organizes it into a pandas DataFrame. The data includes scores and various attributes related to the protein generation process.

Parameters:

trbfile (str) – The file path to the .trb file generated by the protein generator.

Returns:

A DataFrame containing parsed scores and attributes from the .trb file.

Return type:

pd.DataFrame

Examples

Here is an example of how to use the parse_trbfile method:

from protein_generator import ProteinGenerator

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator()

# Parse the .trb file
df = protein_generator.parse_trbfile(trbfile="output_directory/sample.trb")

print(df)
Further Details:
  • File Reading: The method uses numpy to load the .trb file, which is expected to be in a specific format.

  • Data Extraction: The method extracts various pieces of information from the .trb file, including description, location, lddt scores, sequence, and contigs.

  • Data Formatting: The extracted data is organized into a dictionary and converted into a pandas DataFrame for ease of use in further analysis.

run(poses, prefix, jobstarter, options=None, pose_options=None, overwrite=False)[source]

Execute protein_generator with given poses and jobstarter configuration. This method sets up and runs protein_generator using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter) – An instance of the JobStarter class, which manages job execution.

  • options (str, optional) – Additional options for the protein generation script. Defaults to None.

  • pose_options (list[str], optional) – A list of pose-specific options for the protein generation script. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the protein generation process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If pose_options are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protein_generator import ProteinGenerator

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator()

# Run the protein generation process
results = protein_generator.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="generation.num_proteins=10",
    pose_options=["generation.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor protein_generator to their specific needs.

This method is designed to streamline the execution of protein generation processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein generation simulations.

write_cmd(pose_path, output_dir, options, pose_options)[source]

Write the command to run the protein_generator.py script with specified options and pose configurations.

This method constructs the command string necessary to execute the protein generator script, incorporating specified options and pose-specific parameters. The generated command can be used to run the protein generation process in a computational environment.

Parameters:
  • pose_path (str) – The file path to the input pose.

  • output_dir (str) – The directory where output files will be stored.

  • options (str) – Additional options for the protein generator script.

  • pose_options (str) – Specific options for the protein generator script related to the pose.

Returns:

The constructed command string to execute the protein generator script.

Return type:

str

Examples

Here is an example of how to use the write_cmd method:

from protein_generator import ProteinGenerator

# Initialize the ProteinGenerator class
protein_generator = ProteinGenerator(
    script_path="/path/to/protein_generator.py",
    python_path="/path/to/python"
)

# Construct the command string
cmd = protein_generator.write_cmd(
    pose_path="input_poses/pose1.pdb",
    output_dir="output_directory",
    options="generation.num_proteins=10",
    pose_options="generation.input_pdb='input.pdb'"
)

print(cmd)
Further Details:
  • Command Construction: The method parses the input pose path to derive a description and combines it with provided options and flags to construct a command string.

  • Options Parsing: The options and pose_options parameters are parsed into a format compatible with the protein generator script.

  • Output Directory: The output directory is specified to ensure that generated files are stored in the appropriate location.

Parameters:
  • script_path (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (JobStarter)

protflow.tools.residue_selectors module

Module to select residues in a Poses class and add the resulting motif into Poses.df.

This module provides the functionality to select specific residues from protein structures represented as Poses objects. It includes various selector classes that allow for different criteria of residue selection, such as by chain or based on existing residue selections.

Classes:
  • ResidueSelector: Abstract base class for all residue selectors.

  • ChainSelector: Selects all residues of specified chains.

  • TrueSelector: Selects all residues of a pose.

  • NotSelector: Selects all residues except those specified by a residue selection.

Dependencies:
  • protflow.residues

  • protflow.poses

  • protflow.utils.biopython_tools

Examples:

# Example usage of ChainSelector
poses = Poses()
chain_selector = ChainSelector(poses=poses, chain='A')
chain_selector.select(prefix='selected_chain_A')

# Example usage of TrueSelector
true_selector = TrueSelector(poses=poses)
true_selector.select(prefix='all_residues')

# Example usage of NotSelector
residue_selection = ResidueSelection(['A1', 'A2'])
not_selector = NotSelector(poses=poses, residue_selection=residue_selection)
not_selector.select(prefix='not_selected')

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.residue_selectors.ChainSelector(poses=None, chain=None, chains=None)[source]

Bases: ResidueSelector

Selects all residues of a given chain in Poses.

This class extends ResidueSelector to allow selection of residues based on specific chains from the protein structures contained in a Poses object.

Parameters:
poses

The Poses object containing the protein structures.

Type:

Poses

chains

A list of chain identifiers to select residues from.

Type:

list[str]

chain

A single chain identifier to select residues from.

Type:

str

__init__(poses=None, chain=None, chains=None)[source]

Initialize the ChainSelector with optional Poses object and chain(s).

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • chain (list, optional) – A single chain identifier. Defaults to None.

  • chains (list, optional) – A list of chain identifiers. Defaults to None.

Raises:

ValueError – If both chain and chains are provided.

Examples

>>> poses = Poses()
>>> selector = ChainSelector(poses=poses, chain='A')
prep_chain_input(chain=None, chains=None)[source]

Prepares chain input for chain selection.

This method ensures that method parameters take precedence over class attributes. This means that ChainSelector.select(chain=”A”) has higher priority than ChainSelector(chain=”C”).

Parameters:
  • chain (str, optional) – A single chain identifier. Defaults to None.

  • chains (list[str], optional) – A list of chain identifiers. Defaults to None.

Returns:

The list of chain identifiers to use for selection.

Return type:

list[str]

Raises:
  • ValueError – If both chain and chains are provided.

  • ValueError – If both self.chain and self.chains are set.

  • ValueError – If no chain identifiers are set.

Examples

>>> chains = selector.prep_chain_input(chain='A')
select(prefix, poses=None, chain=None, chains=None)[source]

Selects all residues of a given chain for all poses in a Poses object.

Selected residues are added as ResidueSelection objects under the column prefix in Poses.df.

Parameters:
  • prefix (str) – The name of the column that will be added to Poses.df.

  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • chain (str, optional) – A single chain identifier. Defaults to None.

  • chains (list[str], optional) – A list of chain identifiers. Defaults to None.

Raises:
  • ValueError – If no poses are provided or set in the instance.

  • ValueError – If both chain and chains are provided.

  • KeyError – If specified chains are not present in the pose.

Return type:

None

Examples

>>> selector.select(prefix='selected_chain_A', chain='A')
select_single(pose_path, chains)[source]

Selects residues of a given chain of poses and returns them as a ResidueSelection object.

Parameters:
  • pose_path (str) – The file path to the pose structure.

  • chains (list[str]) – A list of chain identifiers.

Returns:

The selected residues of the pose.

Return type:

ResidueSelection

Raises:

KeyError – If specified chains are not present in the pose.

Examples

>>> selection = selector.select_single(pose_path='path/to/pose.pdb', chains=['A'])
set_chain(chain=None)[source]

Sets a single chain for the select() method.

Parameters:

chain (str, optional) – A single chain identifier. Defaults to None.

Raises:

ValueError – If chain is not a single-character string.

Return type:

None

Examples

>>> selector.set_chain(chain='A')
set_chains(chains=None)[source]

Sets chains for the select() method.

Parameters:

chains (list[str], optional) – A list of chain identifiers to select residues from. Defaults to None.

Raises:

ValueError – If chains is not a list of single-character strings.

Return type:

None

Examples

>>> selector.set_chains(chains=['A', 'B'])
class protflow.tools.residue_selectors.DistanceSelector(center=None, distance=None, operator='<=', poses=None, center_atoms=None, noncenter_atoms=None, include_center=False)[source]

Bases: ResidueSelector

TODO: doc string generation Selects all residues that have a certain distance from another residue.

This class extends ResidueSelector to allow selection of residues based on distances to other residues from the protein structures contained in a Poses object.

Parameters:
poses

The Poses object containing the protein structures.

Type:

Poses

__init__(center=None, distance=None, operator='<=', poses=None, center_atoms=None, noncenter_atoms=None, include_center=False)[source]

Initialize the DistanceSelector with optional Poses object, center, distance, operator, center_atoms and noncenter_atoms).

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • center ([ResidueSelection, str, list], optional) – A single ResidueSelector, the name of a poses DataFrame column containing ResidueSelectors or a list of ResidueSelectors. Defaults to None.

  • distance (float, optional) – A float value indicating the distance for residue selection. Defaults to None.

  • operator (str, optional) – A string indicating the operator that should be used together with :distance: for residue selection. Defaults to ‘<=’.

  • center_atoms ([list, str], optional) – A string containing a single atom name or a list of atom names from which distances should be calculated. Defaults to None.

  • noncenter_atoms ([list, str], optional) – A string containing a single atom name or a list of atom names to which distances should be calculated. Defaults to None.

  • include_center (bool, optional) – Include the center in the output residue selection. Defaults to False.

Raises:

ValueError – If both chain and chains are provided.

Examples

>>> poses = Poses()
>>> selector = ChainSelector(poses=poses, chain='A')
extract_center(center, poses)[source]

Extracts centers from input.

Parameters:
  • center ([ResidueSelection, str, list]) – A single ResidueSelection, a list of ResidueSelections or the name of a dataframe column containing ResidueSelections.

  • poses (Poses) – A poses object.

Raises:
  • ValueError – If center is not a single ResidueSelection, a list of ResidueSelections or the name of a dataframe column containing ResidueSelections.

  • ValueError – If the length of the input ResidueSelections is different to the number of poses.

Return type:

list

Examples

>>> selector.set_centers(center="residue_selection_col")
select(prefix, poses=None, center=None, distance=None, operator=None, center_atoms=None, noncenter_atoms=None, include_center=False)[source]

Selects all residues with a certain distance from center for all poses in a Poses object.

Selected residues are added as ResidueSelection objects under the column prefix in Poses.df.

Parameters:
  • prefix (str) – The name of the column that will be added to Poses.df.

  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • center ([ResidueSelection, str, list], optional) – A single ResidueSelector, the name of a poses DataFrame column containing ResidueSelectors or a list of ResidueSelectors. Defaults to None.

  • distance (float, optional) – A float value indicating the distance for residue selection. Defaults to None.

  • operator (str, optional) – A string indicating the operator that should be used together with :distance: for residue selection. Defaults to None.

  • center_atoms ([list, str], optional) – A string containing a single atom name or a list of atom names from which distances should be calculated. Defaults to None.

  • noncenter_atoms ([list, str], optional) – A string containing a single atom name or a list of atom names to which distances should be calculated. Defaults to None.

  • include_center (bool, optional) – Include the center in the output residue selection. Defaults to False.

Raises:
  • ValueError – If no poses are provided or set in the instance.

  • ValueError – If no distance is provided or set in the instance.

  • ValueError – If no operator is provided or set in the instance.

  • ValueError – If no center is provided or set in the instance.

  • ValueError – If center_atoms or noncenter_atoms is not a string or a list of strings.

Return type:

None

Examples

>>> selector.select(prefix='selected_chain_A', chain='A')
select_single(pose_path, center, distance, operator, center_atoms=None, noncenter_atoms=None, include_center=False)[source]

Selects residues of a given chain of poses and returns them as a ResidueSelection object.

Parameters:
  • pose_path (str) – The file path to the pose structure.

  • center (ResidueSelection) – A single ResidueSelection indicating the residues from which distances should be calculated.

  • distance (float) – A single float value indicating the distance for residue selection.

  • operator (str) – A single string indicating the operator for residue selection.

  • center_atoms (list, optional) – A list of atom names for center residues which should be considered for distance calculation.

  • noncenter_atoms (list, optional) – A list of atom names for noncenter residues which should be considered for distance calculation.

  • include_center (bool, optional) – Include the center in the output residue selection. Defaults to False.

Returns:

The selected residues of the pose.

Return type:

ResidueSelection

Raises:

KeyError – If specified center ResidueSelection is not found in the pose.

Examples

>>> selection = selector.select_single(pose_path='path/to/pose.pdb', chains=['A'])
set_center_atoms(center_atoms=None)[source]

Sets centers for the select() method.

Parameters:

center_atoms ([str, list]) – A single atom name or a list of atom names. Default is None.

Raises:

ValueError – If centcenter_atomser is not a single atom name or a list of atom names.

Return type:

None

Examples

>>> selector.set_centers(center="residue_selection_col")
set_centers(center=None)[source]

Sets centers for the select() method.

Parameters:

center ([ResidueSelection, str, list]) – A single ResidueSelection, a list of ResidueSelections or the name of a dataframe column containing ResidueSelections.

Raises:

ValueError – If center is not a single ResidueSelection, a list of ResidueSelections or the name of a dataframe column containing ResidueSelections.

Return type:

None

Examples

>>> selector.set_centers(center="residue_selection_col")
set_distance(distance=None)[source]

Sets distance for the select() method.

Parameters:

distance (float, optional) – A float value. Default None

Raises:

ValueError – If distance is not a single float value.

Return type:

None

Examples

>>> selector.set_distance(distance=8.4)
set_include_center(include_center=False)[source]

Sets include_center for the select() method.

Parameters:

include_center (bool, optional) – True or False. Default False.

Raises:

ValueError – If center is not a single ResidueSelection, a list of ResidueSelections or the name of a dataframe column containing ResidueSelections.

Return type:

None

Examples

>>> selector.set_include_center(include_center=True)
set_noncenter_atoms(noncenter_atoms=None)[source]

Sets noncenter_atoms for the select() method.

Parameters:

noncenter_atoms ([str, list]) – A single atom name or a list of atom names. Default is None.

Raises:

ValueError – If noncenter_atoms is not a single atom name or a list of atom names.

Return type:

None

Examples

>>> selector.set_centers(center="residue_selection_col")
set_operator(operator)[source]

Sets the operator for the select() method.

Parameters:

operator (str) – An operator. Must be one of ‘<’, ‘>’, ‘<=’ or ‘>=’.

Raises:

ValueError – If operator is not one of ‘<’, ‘>’, ‘<=’ or ‘>=’..

Return type:

None

Examples

>>> selector.set_operator(operator="<")
class protflow.tools.residue_selectors.NotSelector(poses=None, residue_selection=None, contig=None)[source]

Bases: ResidueSelector

ResidueSelector that selects all residues except the ones specified by a residue selection.

This class extends ResidueSelector to exclude specified residues from selection. The excluded residues can be provided either as a ResidueSelection object or as a contig string.

Parameters:
__init__(poses=None, residue_selection=None, contig=None)[source]

Initialize the NotSelector with optional Poses object and exclusion criteria.

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • residue_selection (ResidueSelection|str, optional) – The residues to be excluded from selection. Can be a ResidueSelection object or a string. Defaults to None.

  • contig (str, optional) – A string specifying the residues to be excluded in a contig format. Defaults to None.

Raises:

ValueError – If both residue_selection and contig are provided.

Examples

>>> poses = Poses()
>>> residue_selection = ResidueSelection(['A1', 'A2'])
>>> selector = NotSelector(poses=poses, residue_selection=residue_selection)
prep_residue_selection(residue_selection, poses)[source]

Prepares the residue_selection parameter for the select() function.

Parameters:
  • residue_selection (ResidueSelection|str) – The residues to be excluded from selection. Can be a ResidueSelection object or a string.

  • poses (Poses) – The Poses object containing the protein structures.

Returns:

A list of ResidueSelection objects for each pose.

Return type:

list[ResidueSelection]

Raises:

TypeError – If the residue_selection parameter is not of a supported type.

Examples

>>> residue_selection_list = selector.prep_residue_selection(residue_selection='selected_residues', poses=poses)
select(prefix, poses=None, residue_selection=None, contig=None)[source]

Selects all residues except the ones specified in residue_selection or by contig.

The parameter residue_selection can be either a ResidueSelection object or a string pointing to a column in the poses.df that contains ResidueSelection objects.

Parameters:
  • prefix (str) – The name of the column that will be added to Poses.df.

  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • residue_selection (ResidueSelection|str, optional) – The residues to be excluded from selection. Defaults to None.

  • contig (str, optional) – A string specifying the residues to be excluded in a contig format. Defaults to None.

Raises:
  • ValueError – If no poses are provided or set in the instance.

  • ValueError – If both residue_selection and contig are provided.

Return type:

None

Examples

>>> selector.select(prefix='not_selected', residue_selection='selected_residues')
select_single(pose_path, residue_selection)[source]

Selects all residues except the ones specified in residue_selection or by contig.

Parameters:
  • pose_path (str) – The file path to the pose structure.

  • residue_selection (ResidueSelection) – The residues to be excluded from selection.

Returns:

The selected residues of the pose.

Return type:

ResidueSelection

Examples

>>> selection = selector.select_single(pose_path='path/to/pose.pdb', residue_selection=residue_selection)
set_contig(contig)[source]

Sets the contig attribute for the NotSelector class.

Parameters:

contig (str) – A string specifying the residues to be excluded in a contig format.

Raises:

ValueError – If contig is not a string.

Return type:

None

Examples

>>> selector.set_contig(contig='A1-7,A25-109,B45-50,C1,C3,C5')
set_residue_selection(residue_selection=None)[source]

Sets the residue_selection attribute for the NotSelector class.

Parameters:

residue_selection (ResidueSelection, optional) – The residues to be excluded from selection. Defaults to None.

Return type:

None

Examples

>>> residue_selection = ResidueSelection(['A1', 'A2'])
>>> selector.set_residue_selection(residue_selection=residue_selection)
class protflow.tools.residue_selectors.ResidueSelector(poses=None)[source]

Bases: object

Abstract base class for ResidueSelectors.

All ResidueSelector classes must implement a select() method that selects residues from Poses.

Parameters:

poses (Poses)

poses

The Poses object containing the protein structures.

Type:

Poses

__init__(poses=None)[source]

Initializes the ResidueSelector with an optional Poses object.

Parameters:

poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

Examples

poses = Poses() selector = ResidueSelector(poses=poses)

select(prefix)[source]

Abstract method to select residues in poses.

This method must be implemented by subclasses. The selected residues will be added as ResidueSelection objects under the column prefix in Poses.df.

Parameters:

prefix (str) – The name of the column that will be added to Poses.df. Poses.df[prefix] holds the selected Residues as a ResidueSelection object.

Raises:

NotImplementedError – If the method is not implemented by a subclass.

Return type:

None

Examples

>>> class MySelector(ResidueSelector):
...     def select(self, prefix):
...         # implementation here
...         pass
>>> selector = MySelector(poses=poses)
>>> selector.select(prefix='selected_residues')
select_single(*args)[source]

Abstract method to select residues for a single pose.

This method must be implemented by subclasses. It returns a ResidueSelection that contains the selected residues of the pose.

Returns:

The selected residues of the pose.

Return type:

ResidueSelection

Raises:

NotImplementedError – If the method is not implemented by a subclass.

Examples

>>> class MySelector(ResidueSelector):
...     def select_single(self, *args):
...         # implementation here
...         return ResidueSelection()
>>> selector = MySelector(poses=poses)
>>> selection = selector.select_single()
set_poses(poses=None)[source]

Sets the poses for the ResidueSelector class.

Parameters:

poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

Raises:

TypeError – If the poses parameter is not of type Poses.

Return type:

None

Examples

poses = Poses() selector.set_poses(poses=poses)

class protflow.tools.residue_selectors.TrueSelector(poses=None)[source]

Bases: ResidueSelector

ResidueSelector that selects all residues of a pose.

This class extends ResidueSelector to select all residues from each pose in a Poses object. It adds the selected residues as ResidueSelection objects under a specified column in Poses.df.

Parameters:

poses (Poses)

__init__(poses=None)[source]

Initialize the TrueSelector with an optional Poses object.

Parameters:

poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

Examples

>>> poses = Poses()
>>> selector = TrueSelector(poses=poses)
select(prefix, poses=None)[source]

Selects all residues of a given pose for all poses in a Poses object.

Selected residues are added as ResidueSelection objects under the column prefix in Poses.df.

Parameters:
  • prefix (str) – The name of the column that will be added to Poses.df.

  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

Raises:

ValueError – If no poses are provided or set in the instance.

Examples

>>> selector.select(prefix='all_residues')
select_single(pose_path)[source]

Selects all residues in a single pose and returns them as a ResidueSelection object.

Parameters:

pose_path (str) – The file path to the pose structure.

Returns:

The selected residues of the pose.

Return type:

ResidueSelection

Examples

>>> selection = selector.select_single(pose_path='path/to/pose.pdb')

protflow.tools.rfdiffusion module

RFdiffusion Module

This module provides the functionality to integrate RFdiffusion within the ProtFlow framework. It offers tools to run RFdiffusion, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The RFdiffusion class encapsulates the functionality necessary to execute RFdiffusion runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of diffusion processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of RFdiffusion into larger computational workflows. It supports the automatic setup of job parameters, execution of RFdiffusion commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the RFdiffusion class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the diffusion process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the RFdiffusion class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rfdiffusion import RFdiffusion

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the RFdiffusion class
rfdiffusion = RFdiffusion()

# Run the diffusion process
results = rfdiffusion.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_diffusions=3,
    options="inference.num_designs=10",
    pose_options=["inference.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the diffusion process.

  • Customizability: Users can customize the diffusion process through multiple parameters, including the number of diffusions, specific options for the RFdiffusion script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate RFdiffusion into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.rfdiffusion.RFdiffusion(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Bases: Runner

RFdiffusion Class

The RFdiffusion class is a specialized class designed to facilitate the execution of RFdiffusion within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with RFdiffusion processes.

Detailed Description

The RFdiffusion class manages all aspects of running RFdiffusion simulations. It handles the configuration of necessary scripts and executables, prepares the environment for diffusion processes, and executes the diffusion commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to RFdiffusion scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of RFdiffusion commands with support for multiple diffusions.

  • Collecting and processing output data into a pandas DataFrame.

  • Updating motifs based on RFdiffusion outputs and remapping residue selections.

rtype:

An instance of the `RFdiffusion class`, configured to run RFdiffusion processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the RFdiffusion class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rfdiffusion import RFdiffusion

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the RFdiffusion class
rfdiffusion = RFdiffusion()

# Run the diffusion process
results = rfdiffusion.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_diffusions=3,
    options="inference.num_designs=10",
    pose_options=["inference.input_pdb='input.pdb'"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the diffusion process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The RFdiffusion class is intended for researchers and developers who need to perform RFdiffusion simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(script_path=None, python_path=None, pre_cmd=None, jobstarter=None)[source]

Initialize the RFdiffusion class.

This constructor sets up the necessary paths to the RFdiffusion script and Python executable, and initializes the job starter. The paths are configured using default values from the ProtFlow configuration, but they can be manually set if required. However, manual setting is generally not recommended due to the potential for misconfiguration.

Detailed Description

The __init__ method initializes the RFdiffusion class by setting up essential paths and configurations. It ensures that the paths to the RFdiffusion script and Python executable are correctly set, and it initializes the job starter object. This setup is crucial for the proper execution of RFdiffusion processes within the ProtFlow framework.

param script_path (str:

type script_path (str:

The path to the RFdiffusion script. Defaults to the value specified in the ProtFlow configuration (`config.RFDIFFUSION_SCRIPT_PATH).`

param optional):

type optional):

The path to the RFdiffusion script. Defaults to the value specified in the ProtFlow configuration (`config.RFDIFFUSION_SCRIPT_PATH).`

param python_path (str:

type python_path (str:

The path to the Python executable used to run the RFdiffusion script. Defaults to the value specified in the ProtFlow configuration (`config.RFDIFFUSION_PYTHON_PATH).`

param optional):

type optional):

The path to the Python executable used to run the RFdiffusion script. Defaults to the value specified in the ProtFlow configuration (`config.RFDIFFUSION_PYTHON_PATH).`

param jobstarter (JobStarter:

type jobstarter (JobStarter:

An instance of the JobStarter class, which manages job execution. If not provided, the default is `None.`

param optional):

type optional):

An instance of the JobStarter class, which manages job execution. If not provided, the default is `None.`

rtype:

None

raises ValueError:

If the provided paths are invalid or if there are issues with the configuration.

Examples

Here is an example of how to initialize the RFdiffusion class:

from protflow.jobstarters import JobStarter
from rfdiffusion import RFdiffusion

# Initialize the RFdiffusion class
rfdiffusion = RFdiffusion()

# Initialize with custom paths
custom_rfdiffusion = RFdiffusion(
    script_path="/path/to/custom/rfdiffusion_script.py",
    python_path="/path/to/custom/python"
)

# Initialize with a job starter
jobstarter = JobStarter()
rfdiffusion_with_jobstarter = RFdiffusion(jobstarter=jobstarter)
Further Details
  • Path Configuration: The paths to the RFdiffusion script and Python executable are critical for the correct functioning of the class. It is recommended to use the default paths provided by the ProtFlow configuration unless there is a specific need to customize them.

  • JobStarter Integration: The JobStarter object is used to manage job execution, ensuring that RFdiffusion processes are handled efficiently. If a JobStarter is not provided, the class will operate without it, but it is recommended to use one for better job management.

This method is designed for initializing the RFdiffusion class with the necessary configurations, making it ready for executing RFdiffusion processes within the ProtFlow framework.

Parameters:
  • script_path (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (JobStarter)

Return type:

None

parse_rfdiffusion_opts(options, pose_options)[source]

Parse and combine general and pose-specific RFdiffusion options into a dictionary.

This method splits and processes both general options and pose-specific options, combining them into a single dictionary. Pose-specific options will overwrite general options if there are conflicts.

Parameters:
  • options (str) – General options for the RFdiffusion script.

  • pose_options (str) – Pose-specific options for the RFdiffusion script.

Returns:

A dictionary containing the combined options, with pose-specific options taking precedence over general options.

Return type:

dict

Examples

Here is an example of how to use the parse_rfdiffusion_opts method:

options = "inference.num_designs=10 inference.use_gpu=True"
pose_options = "inference.input_pdb='input.pdb' inference.num_designs=5"
parsed_opts = rfdiffusion.parse_rfdiffusion_opts(options, pose_options)
# parsed_opts will be:
# {'inference.num_designs': '5', 'inference.use_gpu': 'True', 'inference.input_pdb': "'input.pdb'"}
Further Details:
  • Option Splitting: The method uses regular expressions to split the options string into individual option entries, ensuring that options within quotes are not split incorrectly.

  • Option Overwriting: By adding pose-specific options after general options, the method ensures that pose-specific options can overwrite general options if necessary.

This method is designed to create a consolidated dictionary of options for the RFdiffusion script, facilitating the construction of command strings with the appropriate parameters.

remap_motifs(poses, motifs, prefix)[source]

Updates ResidueSelection type motifs in poses.df when given prefix of RFdiffusion run.

This method updates the residue mappings of specified motifs in the poses DataFrame based on the RFdiffusion outputs. It ensures that the motifs are correctly mapped to the new residue indices as generated by the diffusion process.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures and associated data.

  • motifs (list) – A list of motif columns to update in the poses DataFrame.

  • prefix (str) – The prefix used to identify the relevant columns in the poses DataFrame from the RFdiffusion outputs.

Returns:

None

Raises:

TypeError – If the motifs are not of the expected type ResidueSelection.

Return type:

None

Further Details: - Motif Update: The method ensures that the specified motifs in the poses DataFrame are updated with the new residue mappings from the RFdiffusion outputs. - Residue Mapping: The method uses the reference and halogenated residue indices generated by RFdiffusion to update the motifs. - Integration: This method integrates seamlessly with the RFdiffusion workflow, ensuring that motifs are correctly remapped after diffusion processes.

This method is designed to update the residue mappings of motifs in the poses DataFrame, facilitating accurate representation of the protein structures after RFdiffusion processes.

run(poses, prefix, jobstarter=None, num_diffusions=1, options=None, pose_options=None, overwrite=False, multiplex_poses=False, update_motifs=None, fail_on_missing_output_poses=False)[source]

Execute the RFdiffusion process with given poses and jobstarter configuration.

This method sets up and runs the RFdiffusion process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • num_diffusions (int, optional) – The number of diffusions to run for each input pose. Be aware that the number of output poses per input pose is multiplex_poses * num_diffusions! Defaults to 1.

  • options (str, optional) – Additional options for the RFdiffusion script. Defaults to None.

  • pose_options (list[str], optional) – A list of pose-specific options for the RFdiffusion script. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • multiplex_poses (int, optional) – If specified, create multiple copies of poses to fully utilize parallel computing. Be aware that the number of output poses per input pose is multiplex_poses * num_diffusions! Defaults to False.

  • update_motifs (list[str], optional) – A list of motifs to update based on the RFdiffusion outputs. Defaults to None.

  • fail_on_missing_output_poses (bool, optional) – RFdiffusion runs crash sometimes unexpectedly, which might disrupt longer pipelines. Fail if some poses are missing. Defaults to False.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the RFdiffusion process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If motifs are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rfdiffusion import RFdiffusion

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the RFdiffusion class
rfdiffusion = RFdiffusion()

# Run the diffusion process
results = rfdiffusion.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    num_diffusions=3,
    options="inference.num_designs=10",
    pose_options=["inference.input_pdb='input.pdb'"],
    overwrite=True,
    fail_on_missing_output_poses=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the diffusion process to their specific needs.

This method is designed to streamline the execution of RFdiffusion processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze diffusion simulations.

write_cmd(pose, options, pose_opts, output_dir, num_diffusions=1)[source]

Construct the command to run the RFdiffusion process.

This method constructs the command string required to execute the RFdiffusion process. It combines the specified options and pose-specific options, ensuring that all necessary parameters are included.

Parameters:
  • pose (str) – The path to the input pose file.

  • options (str) – General options for the RFdiffusion script.

  • pose_opts (str) – Pose-specific options for the RFdiffusion script.

  • output_dir (str) – The directory where output files will be saved.

  • num_diffusions (int, optional) – The number of diffusions to run for each input pose. Defaults to 1.

Returns:

The constructed command string to execute the RFdiffusion process.

Return type:

str

Raises:

ValueError – If the provided options or pose_opts are invalid.

Examples

Construct a command for RFdiffusion:

cmd = rfdiffusion.write_cmd("input.pdb", "inference.num_designs=10", "inference.input_pdb='input.pdb'", "/output", 3)
Further Details:
  • Option Parsing: The method parses both general and pose-specific options, ensuring that they are correctly formatted and included in the command string.

  • Command Construction: The constructed command string includes the path to the RFdiffusion script, the specified options, and the output directory.

  • Default Values: Default values for unspecified options, such as inference.num_designs, are included to ensure the command string is complete.

This method is designed to create a fully-formed command string for running RFdiffusion, making it easier to execute diffusion processes with the desired parameters.

Parameters:
  • script_path (str | None)

  • python_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (JobStarter)

protflow.tools.rfdiffusion.collect_scores(work_dir, rename_pdbs=True)[source]

Collect scores from RFdiffusion output files.

This method collects scores from .trb files generated by RFdiffusion into a single pandas DataFrame. It also optionally renames the output .pdb files based on the diffusion process.

Parameters:
  • work_dir (str) – The working directory where RFdiffusion output files are stored.

  • rename_pdbs (bool, optional) – If True, rename the .pdb files based on the new descriptions. Defaults to True.

Returns:

A DataFrame containing the collected scores from the RFdiffusion output.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If no .pdb files are found in the specified directory.

Examples

Here is an example of how to use the collect_scores method:

work_dir = "/path/to/output"
scores_df = rfdiffusion.collect_scores(work_dir, rename_pdbs=True)
# scores_df will contain the combined scores from the RFdiffusion output files
Further Details:
  • Score Collection: The method iterates over .pdb files in the specified directory, collecting corresponding .trb files and concatenating their scores into a DataFrame.

  • File Renaming: If rename_pdbs is set to True, the method renames the .pdb files based on new descriptions to ensure unique identification.

  • DataFrame Structure: The resulting DataFrame includes relevant score information, and columns are renamed appropriately if files are renamed.

This method is designed to streamline the collection and organization of RFdiffusion output scores, facilitating further analysis and processing.

protflow.tools.rfdiffusion.get_residue_mapping(con_ref_idx, con_hal_idx)[source]

Create a residue mapping dictionary from RFdiffusion outputs.

This method creates a mapping dictionary that maps old residue indices (from con_ref_idx) to new residue indices (from con_hal_idx).

Parameters:
  • con_ref_idx (list) – A list of reference residue indices from the RFdiffusion outputs, where each element is a tuple of (chain, residue_id).

  • con_hal_idx (list) – A list of halogenated residue indices from the RFdiffusion outputs, where each element is a tuple of (chain, residue_id).

Returns:

A dictionary where keys are tuples of (chain, residue_id) from con_ref_idx and values are tuples of (chain, residue_id) from con_hal_idx.

Return type:

dict

Examples

Here is an example of how to use the get_residue_mapping method:

con_ref_idx = [("A", 10), ("A", 20)]
con_hal_idx = [("A", 11), ("A", 21)]
residue_mapping = rfdiffusion.get_residue_mapping(con_ref_idx, con_hal_idx)
# residue_mapping will be: {("A", 10): ("A", 11), ("A", 20): ("A", 21)}
Further Details:
  • Mapping Creation: The method pairs each element in con_ref_idx with the corresponding element in con_hal_idx to create the mapping.

  • Usage: This mapping is useful for updating residue selections based on RFdiffusion outputs.

This method is designed to facilitate the creation of residue mappings for updating motifs or other residue-based selections.

protflow.tools.rfdiffusion.parse_diffusion_trbfile(path)[source]

Parse a .trb file from RFdiffusion and extract relevant scores into a pandas DataFrame.

This method reads a .trb file generated by RFdiffusion, extracts relevant scoring information, and organizes it into a DataFrame. The extracted information includes pLDDT scores, residue indices, and metadata.

Parameters:

path (str) – The path to the .trb file.

Returns:

A DataFrame containing the extracted scores and metadata from the .trb file.

Return type:

pd.DataFrame

Raises:

ValueError – If the provided file path does not end with .trb.

Examples

Here is an example of how to use the parse_diffusion_trbfile method:

path = "/path/to/output.trb"
scores_df = rfdiffusion.parse_diffusion_trbfile(path)
# scores_df will contain the extracted scores and metadata
Further Details:
  • File Reading: The method uses numpy to load the .trb file and allows for pickled objects.

  • Score Extraction: Extracted scores include mean pLDDT, per-residue pLDDT, and other relevant metrics.

  • Metadata Collection: Metadata such as file location, description, and input PDB are included in the DataFrame.

This method is designed to parse and organize the data from RFdiffusion .trb files, making it easier to analyze the results.

protflow.tools.rfdiffusion.prep_motif_input(motif, df)[source]

Ensure motif input is a list and validate that motifs are present in the DataFrame.

This method checks if the given motif is a string or a list, and ensures it is returned as a list. It also validates that the specified motifs are present as columns in the provided DataFrame.

Parameters:
  • motif (Any) – The motif or list of motifs to validate and process.

  • df (pd.DataFrame) – The DataFrame in which to check for the presence of the motifs.

Returns:

A list of motif column names.

Return type:

list[str]

Raises:

ValueError – If any of the specified motifs are not present in the DataFrame.

Examples

Here is an example of how to use the prep_motif_input function:

df = pd.DataFrame({"motif1": [1, 2, 3], "motif2": [4, 5, 6]})
motif = "motif1"
motifs_list = rfdiffusion.prep_motif_input(motif, df)
# motifs_list will be: ["motif1"]
Further Details:
  • Motif Handling: The method ensures that even a single motif string is converted into a list to standardize processing.

  • Validation: It checks that each motif in the list is a column in the provided DataFrame, raising an error if any are missing.

This function is designed to prepare and validate motif inputs for further processing in the RFdiffusion workflow.

protflow.tools.rfdiffusion.update_motif_res_mapping(motif_l, con_ref_idx, con_hal_idx)[source]

Update motifs in motif_l based on con_ref_idx and con_hal_idx.

This method updates the residue mappings of motifs in the provided list based on the reference and halogenated residue indices from RFdiffusion outputs.

Parameters:
  • motif_l (list[ResidueSelection]) – A list of ResidueSelection objects representing the motifs to be updated.

  • con_ref_idx (list) – A list of reference residue indices from the RFdiffusion outputs.

  • con_hal_idx (list) – A list of halogenated residue indices from the RFdiffusion outputs.

Returns:

A list of updated ResidueSelection objects with new residue mappings.

Return type:

list

Raises:

TypeError – If any element in motif_l is not of type ResidueSelection.

Examples

Here is an example of how to use the update_motif_res_mapping method:

motif_l = [ResidueSelection(["A:10", "A:20"]), ResidueSelection(["B:30", "B:40"])]
con_ref_idx = [("A", 10), ("A", 20)]
con_hal_idx = [("A", 11), ("A", 21)]
updated_motifs = rfdiffusion.update_motif_res_mapping(motif_l, con_ref_idx, con_hal_idx)
# updated_motifs will contain the updated ResidueSelection objects
Further Details:
  • Residue Mapping: The method sets up a mapping dictionary from reference to halogenated residue indices.

  • Motif Update: Each motif in the input list is updated according to the new residue mappings and returned as a new ResidueSelection object.

This method is designed to update residue selections in motifs based on the outputs from RFdiffusion, facilitating accurate downstream analysis and interpretation.

protflow.tools.rosetta module

Rosetta Module

This module provides the functionality to integrate Rosetta within the ProtFlow framework. It offers tools to run various Rosetta applications, handle their inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The Rosetta class encapsulates the functionality necessary to execute Rosetta runs. It manages the configuration of paths to essential scripts and executables, sets up the environment, and handles the execution of Rosetta processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of Rosetta into larger computational workflows. It supports the automatic setup of job parameters, execution of Rosetta commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the Rosetta class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the Rosetta process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the Rosetta class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rosetta import Rosetta

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the Rosetta class
rosetta = Rosetta()

# Run the Rosetta process
results = rosetta.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    rosetta_application="RosettaScripts",
    nstruct=10,
    options="",
    pose_options=["-parser:protocol my_protocol.xml"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as invalid paths for executables and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the Rosetta process.

  • Customizability: Users can customize the Rosetta process through multiple parameters, including the number of structures (nstruct), specific options for the Rosetta application, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate Rosetta into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.tools.rosetta.Rosetta(script_path=None, pre_cmd=None, jobstarter=None, fail_on_missing_output_poses=False)[source]

Bases: Runner

Rosetta Class

The Rosetta class is a specialized class designed to facilitate the execution of Rosetta applications within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with Rosetta processes.

Detailed Description

The Rosetta class manages all aspects of running Rosetta simulations. It handles the configuration of necessary scripts and executables, prepares the environment for Rosetta processes, and executes the Rosetta commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to Rosetta scripts and executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of Rosetta commands with support for multiple structures (nstruct).

  • Collecting and processing output data into a pandas DataFrame.

  • Cleaning and renaming PDB files based on Rosetta outputs.

  • Handling score files and converting them into a readable format.

rtype:

An instance of the `Rosetta class`, configured to run Rosetta processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises KeyError:

Examples

Here is an example of how to initialize and use the Rosetta class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rosetta import Rosetta

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the Rosetta class
rosetta = Rosetta()

# Run the Rosetta process
results = rosetta.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    rosetta_application="RosettaScripts",
    nstruct=10,
    options="",
    pose_options=["-parser:protocol my_protocol.xml"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as invalid paths for executables, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the Rosetta process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The Rosetta class is intended for researchers and developers who need to perform Rosetta simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(script_path=None, pre_cmd=None, jobstarter=None, fail_on_missing_output_poses=False)[source]

Initialize the Rosetta class with the necessary configuration.

This method sets up the Rosetta class with the provided script path and job starter configuration. It initializes the necessary parameters and prepares the environment for executing Rosetta processes.

Parameters:
  • script_path (str, optional) – The path to the Rosetta executable scripts. Defaults to the path specified in config.ROSETTA_BIN_PATH.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • pre_cmd (str | None)

  • fail_on_missing_output_poses (bool)

Raises:

ValueError – If no valid script path is provided.

Return type:

None

Examples

Here is an example of how to initialize the Rosetta class:

from protflow.jobstarters import JobStarter
from rosetta import Rosetta

# Initialize the Rosetta class with a specific script path
rosetta = Rosetta(script_path="/path/to/rosetta", jobstarter=JobStarter())
Further Details:
  • Configuration: The method checks if the provided script path is valid and sets it up for further use. If no script path is provided, it defaults to the path specified in the ProtFlow configuration.

  • Job Starter: The job starter parameter can be provided to manage the execution of Rosetta jobs. If not provided, it defaults to None, and the default job starter configuration from ProtFlow will be used.

This method ensures that the Rosetta class is correctly initialized with the necessary configurations to run Rosetta applications within the ProtFlow framework.

run(poses, prefix, jobstarter=None, rosetta_application=None, nstruct=1, options=None, pose_options=None, overwrite=False, fail_on_missing_output_poses=False)[source]

Execute the Rosetta process with given poses and jobstarter configuration.

This method sets up and runs the Rosetta process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses, optional) – The Poses object containing the protein structures. Defaults to None.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • rosetta_application (str, optional) – The specific Rosetta application to be executed. Defaults to None.

  • nstruct (int, optional) – The number of structures to generate for each input pose. Defaults to 1.

  • options (str, optional) – Additional options for the Rosetta application. Defaults to None.

  • pose_options (list[str] | str, optional) – A list of pose-specific options for the Rosetta application. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • fail_on_missing_output_poses (bool)

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the Rosetta process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • KeyError – If forbidden options are provided to the method.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rosetta import Rosetta

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the Rosetta class
rosetta = Rosetta()

# Run the Rosetta process
results = rosetta.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    rosetta_application="RosettaScripts",
    nstruct=10,
    options="",
    pose_options=["-parser:protocol my_protocol.xml"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It verifies that the Rosetta application is executable and configures the execution environment accordingly.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis. It manages score files, renames PDB files, and compiles output data into a structured format.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the Rosetta process to their specific needs. This includes setting the number of structures to generate, providing additional options for the Rosetta application, and specifying pose-specific options.

This method is designed to streamline the execution of Rosetta processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze Rosetta simulations.

write_cmd(rosetta_application, pose_path, output_dir, i, overwrite=False, options=None, pose_options=None)[source]

Writes the command to run a Rosetta application.

This method constructs the command string needed to execute a Rosetta application with the specified options and parameters. It ensures that the command includes all necessary arguments and handles the setup for running the application.

Parameters:
  • rosetta_application (str) – The path to the Rosetta executable.

  • pose_path (str) – The path to the input pose file.

  • output_dir (str) – The directory where output files will be stored.

  • i (int) – The index of the current structure being processed.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • options (str, optional) – Additional options for the Rosetta application. Defaults to None.

  • pose_options (str, optional) – Pose-specific options for the Rosetta application. Defaults to None.

Returns:

The command string to execute the Rosetta application.

Return type:

str

Raises:

KeyError – If forbidden options are included in the provided options or pose_options.

Examples

Here is an example of how to use the write_cmd method:

from rosetta import Rosetta

# Initialize the Rosetta class
rosetta = Rosetta(script_path="/path/to/rosetta")

# Write the command to run the Rosetta application
cmd = rosetta.write_cmd(
    rosetta_application="/path/to/rosetta/RosettaScripts",
    pose_path="input.pdb",
    output_dir="/path/to/output",
    i=1,
    overwrite=True,
    options="-parser:protocol my_protocol.xml",
    pose_options="-in:file:s input.pdb"
)

print(cmd)
Further Details:
  • Command Construction: The method constructs the command string by combining the executable path, input pose path, output directory, and additional options. It ensures that all necessary arguments are included in the command.

  • Option Parsing: The method parses the provided options and pose_options, ensuring they are correctly formatted and do not include any forbidden options. Forbidden options include arguments that could interfere with the correct execution of the Rosetta application, such as output path settings.

  • Overwrite Handling: If the overwrite parameter is set to True, the command includes the necessary argument to overwrite existing output files. This ensures that the process can be re-run without conflicts.

This method is designed to facilitate the construction of command strings for running Rosetta applications, making it easier for researchers and developers to execute and manage Rosetta simulations within the ProtFlow framework.

Parameters:
  • script_path (str | None)

  • pre_cmd (str | None)

  • jobstarter (str)

  • fail_on_missing_output_poses (bool)

protflow.tools.rosetta.clean_rosetta_scorefile(path_to_file, out_path)[source]

Cleans a faulty Rosetta scorefile.

This function reads a Rosetta scorefile and removes any lines that do not match the expected format (i.e., lines with a different number of columns than the header). It writes the cleaned scores to a new file.

Parameters:
  • path_to_file (str) – The path to the original Rosetta scorefile.

  • out_path (str) – The path where the cleaned scorefile will be saved.

Returns:

The path to the cleaned scorefile.

Return type:

str

Examples

Here is an example of how to use the clean_rosetta_scorefile function:

from rosetta import clean_rosetta_scorefile

# Clean the Rosetta scorefile
cleaned_file_path = clean_rosetta_scorefile(
    path_to_file="path/to/original_scorefile.sc",
    out_path="path/to/cleaned_scorefile.sc"
)

print(f"Cleaned scorefile saved at: {cleaned_file_path}")
Further Details:
  • File Reading: The function reads the scorefile line by line and splits each line into columns.

  • Format Verification: It verifies that each line has the same number of columns as the header. Lines with a different number of columns are removed.

  • File Writing: The cleaned scores are written to the specified output file. A warning is logged indicating the number of lines removed during the cleaning process.

This function is useful for ensuring that Rosetta scorefiles are properly formatted and free of inconsistencies, facilitating accurate data analysis.

protflow.tools.rosetta.collect_scores(work_dir)[source]

Collects scores from Rosetta output files and reindexes PDB files.

This function collects scores from Rosetta output files, reindexes PDB files based on the scores, and stores the scores in a pandas DataFrame. It also renames PDB files in the working directory to match the reindexed names.

Parameters:

work_dir (str) – The directory where Rosetta output files are stored.

Returns:

A DataFrame containing the collected scores with reindexed PDB file names.

Return type:

pandas.DataFrame

Examples

Here is an example of how to use the collect_scores function:

from rosetta import collect_scores

# Collect scores from the Rosetta output directory
scores_df = collect_scores(work_dir="/path/to/output_directory")

print(scores_df)
Further Details:
  • Score Collection: The function collects score files from the specified directory and reads them into a pandas DataFrame.

  • Reindexing: The function reindexes the PDB files based on the scores and renames the files in the working directory accordingly.

  • File Renaming: The function ensures that all Rosetta output PDB files are renamed to match the reindexed names, and the paths to these files are stored in the DataFrame.

  • Consistency Check: The function waits for all Rosetta output files to appear in the output directory, ensuring that the renaming process is consistent and complete.

This function is designed to streamline the process of collecting and organizing Rosetta output data, making it easier for researchers and developers to analyze the results of Rosetta simulations within the ProtFlow framework.

Module contents

runners submodule init