protflow.metrics package

Submodules

protflow.metrics.fpocket module

FPocket Module

This module provides the functionality to integrate FPocket within the ProtFlow framework. It offers tools to run FPocket, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The FPocket class encapsulates the functionality necessary to execute FPocket runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of FPocket processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of FPocket into larger computational workflows. It supports the automatic setup of job parameters, execution of FPocket commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the FPocket class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the FPocket process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the FPocket class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from fpocket import FPocket

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the FPocket class
fpocket = FPocket()

# Run the FPocket process
results = fpocket.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="--some-option value",
    pose_options=["--specific-option value"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the FPocket process.

  • Customizability: Users can customize the FPocket process through multiple parameters, including specific options for the FPocket script and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate FPocket into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.metrics.fpocket.FPocket(fpocket_path=None, jobstarter=None)[source]

Bases: Runner

FPocket Class

The FPocket class is a specialized class designed to facilitate the execution of FPocket within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with FPocket processes.

Detailed Description

The FPocket class manages all aspects of running FPocket simulations. It handles the configuration of necessary scripts and executables, prepares the environment for pocket detection processes, and executes the FPocket commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to FPocket scripts and executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of FPocket commands with support for multiple options and pose-specific parameters.

  • Collecting and processing output data into a pandas DataFrame.

  • Ensuring robust error handling and logging for easier debugging and verification of the FPocket process.

rtype:

An instance of the `FPocket class`, configured to run FPocket processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the FPocket class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from fpocket import FPocket

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the FPocket class
fpocket = FPocket()

# Run the FPocket process
results = fpocket.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="--some-option value",
    pose_options=["--specific-option value"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the FPocket process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The FPocket class is intended for researchers and developers who need to perform FPocket simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(fpocket_path=None, jobstarter=None)[source]

Initialize the FPocket class with the specified path and jobstarter configuration.

This constructor sets up the FPocket instance by configuring the path to the FPocket executable and initializing the jobstarter object. It ensures that the necessary components are in place for running FPocket processes.

Parameters:
  • fpocket_path (str, optional) – The path to the FPocket executable. Defaults to the path specified in the ProtFlow configuration (FPOCKET_PATH).

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

Returns:

An instance of the FPocket class, ready to run FPocket processes.

Raises:

ValueError – If the fpocket_path is not provided or is invalid.

Examples

Here is an example of how to initialize the FPocket class:

from protflow.jobstarters import JobStarter
from fpocket import FPocket

# Initialize the FPocket class with default settings
fpocket = FPocket()

# Initialize the FPocket class with a specific jobstarter
jobstarter = JobStarter()
fpocket = FPocket(jobstarter=jobstarter)
Further Details:
  • Path Configuration: Ensures the FPocket executable path is set correctly, raising an error if the path is not provided or invalid.

  • Job Management: Initializes the jobstarter object to manage the execution of FPocket commands, allowing for integration with job scheduling systems.

index_layers = 0
prep_fpocket_options(poses, options, pose_options)[source]

Prepare options for the FPocket process based on given parameters.

This method processes and prepares the options and pose-specific options for the FPocket run. It filters out forbidden options, merges general options with pose-specific options, and formats them for inclusion in the FPocket commands.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • options (str or list[str], optional) – General options for the FPocket script. Defaults to None.

  • pose_options (str or list[str], optional) – A list of pose-specific options for the FPocket script. Defaults to None.

Returns:

A list of formatted option strings for each pose, ready to be used in the FPocket commands.

Return type:

list[str]

Raises:

TypeError – If options or pose_options are not of the expected type.

Examples

Here is an example of how to use the prep_fpocket_options method:

from protflow.poses import Poses
from fpocket import FPocket

# Create instances of necessary classes
poses = Poses()
fpocket = FPocket()

# Prepare FPocket options
options = "--some-option value"
pose_options = ["--specific-option value"]
prepared_options = fpocket.prep_fpocket_options(poses, options, pose_options)

# Output the prepared options
print(prepared_options)
Further Details:
  • Option Processing: Merges general and pose-specific options, ensuring that forbidden options are removed and the final option strings are correctly formatted.

  • Customization: Allows for extensive customization of the FPocket process through both general and pose-specific options, providing flexibility in configuring FPocket runs.

run(poses, prefix, jobstarter=None, options=None, pose_options=None, return_full_scores=False, overwrite=False)[source]

Execute the FPocket process with given poses and jobstarter configuration.

This method sets up and runs the FPocket process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • options (str or list[str], optional) – Additional options for the FPocket script. Defaults to None.

  • pose_options (str or list[str], optional) – A list of pose-specific options for the FPocket script. Defaults to None.

  • return_full_scores (bool, optional) – If True, include detailed scores for each pocket in the output. Defaults to False.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An updated Poses object containing the processed poses and results of the FPocket process.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If options or pose_options are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from fpocket import FPocket

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the FPocket class
fpocket = FPocket()

# Run the FPocket process
results = fpocket.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    options="--some-option value",
    pose_options=["--specific-option value"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It moves the poses to the working directory and compiles the FPocket commands for execution.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized into a structured DataFrame. It includes the location of each pocket and integrates the results back into the Poses object.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the FPocket process to their specific needs, including the ability to specify additional FPocket options and pose-specific parameters.

This method is designed to streamline the execution of FPocket processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze pocket detection simulations.

Parameters:
protflow.metrics.fpocket.collect_fpocket_output(output_file, return_full_scores=False)[source]

Collect output from a single FPocket output file.

This function processes the output of a single FPocket output file, extracting scores and other relevant information into a pandas DataFrame.

Parameters:
  • output_file (str) – The path to the FPocket output file.

  • return_full_scores (bool, optional) – If True, include detailed scores for each pocket in the output. Defaults to False.

Returns:

A DataFrame containing the processed output from the FPocket file.

Return type:

pd.DataFrame

Examples

Here is an example of how to use the collect_fpocket_output function:

from fpocket import collect_fpocket_output

# Specify the output file
output_file = "path/to/output_file"

# Collect output
output = collect_fpocket_output(output_file, return_full_scores=True)

# Display the output
print(output)
Further Details:
  • Output Processing: The function reads the FPocket output file, extracts relevant scores and information, and formats them into a DataFrame.

  • Detailed Scores: If the return_full_scores parameter is set to True, the function includes detailed scores for each pocket in the DataFrame.

protflow.metrics.fpocket.collect_fpocket_scores(output_dir, return_full_scores=False)[source]

Collect scores from an FPocket output directory.

This function collects and processes the scores from FPocket output files located in the specified directory. It aggregates the scores into a pandas DataFrame for further analysis.

Parameters:
  • output_dir (str) – The path to the directory containing FPocket output files.

  • return_full_scores (bool, optional) – If True, include detailed scores for each pocket in the output. Defaults to False.

Returns:

A DataFrame containing the collected scores from the FPocket output files.

Return type:

pd.DataFrame

Examples

Here is an example of how to use the collect_fpocket_scores function:

from fpocket import collect_fpocket_scores

# Specify the output directory
output_dir = "path/to/output_directory"

# Collect scores
scores = collect_fpocket_scores(output_dir, return_full_scores=True)

# Display the scores
print(scores)
Further Details:
  • Score Aggregation: The function looks for FPocket output directories, extracts scores from each output file, and combines them into a single DataFrame.

  • Detailed Scores: If the return_full_scores parameter is set to True, the function includes detailed scores for each pocket in the DataFrame.

protflow.metrics.fpocket.get_outfile_name(outdir)[source]

Get the name of the output file from the output directory.

This function generates the name of the FPocket output file based on the specified output directory.

Parameters:

outdir (str) – The path to the FPocket output directory.

Returns:

The name of the output file within the specified directory.

Return type:

str

Examples

Here is an example of how to use the get_outfile_name function:

from fpocket import get_outfile_name

# Specify the output directory
outdir = "path/to/output_directory"

# Get the output file name
output_file_name = get_outfile_name(outdir)

# Display the output file name
print(output_file_name)
Further Details:
  • File Naming: The function constructs the output file name by modifying the output directory name and appending the appropriate suffix.

protflow.metrics.fpocket.parse_fpocket_outfile(output_file)[source]

Parse the FPocket output file to extract scores.

This function reads and parses the FPocket output file, extracting scores and other relevant information into a pandas DataFrame.

Parameters:

output_file (str) – The path to the FPocket output file.

Returns:

A DataFrame containing the parsed scores from the FPocket output file.

Return type:

pd.DataFrame

Examples

Here is an example of how to use the parse_fpocket_outfile function:

from fpocket import parse_fpocket_outfile

# Specify the output file
output_file = "path/to/output_file"

# Parse the output file
scores = parse_fpocket_outfile(output_file)

# Display the scores
print(scores)
Further Details:
  • File Parsing: The function reads the FPocket output file, extracts relevant scores and information, and formats them into a DataFrame.

  • Score Extraction: The function processes the file line by line, extracting score data and organizing it into a structured format.

protflow.metrics.generic_metric_runner module

Generic metric runner for ProtFlow.

This module exposes GenericMetric, a lightweight protflow.runners.Runner that executes any importable Python function over the poses stored in a protflow.poses.Poses object. The target function must accept a single pose path as its first positional argument and return a JSON-serializable value. Additional keyword arguments can be forwarded through the runner’s options dictionary.

How it works

GenericMetric.run() resolves the working directory and jobstarter, splits poses.poses_list() into manageable chunks, and starts one worker command per chunk. Each worker re-enters this module as a small CLI program, imports the requested module and function dynamically, evaluates the function on every pose path in its chunk, and stores the results as JSON. The parent process then concatenates the worker outputs and merges them back into poses.df through RunnerOutput.

Walkthrough

The example below calculates the radius of gyration for every pose by reusing protflow.utils.metrics.calc_rog_of_pdb:

from protflow.poses import Poses
from protflow.jobstarters import SbatchArrayJobstarter
from protflow.metrics.generic_metric_runner import GenericMetric

poses = Poses(
    poses=["/data/designs/design_0001.pdb", "/data/designs/design_0002.pdb"],
    work_dir="/data/protflow_runs"
)
cpu_jobstarter = SbatchArrayJobstarter(max_cores=10)

rog = GenericMetric(
    module="protflow.utils.metrics",
    function="calc_rog_of_pdb",
    options={"chain": "A"},
    jobstarter=cpu_jobstarter,
)

poses = rog.run(poses=poses, prefix="rog")

# GenericMetric stores the returned value in <prefix>_data.
print(poses.df[["poses_description", "rog_data"]])

In that run, GenericMetric will:

  1. Build /data/protflow_runs/rog as its working directory.

  2. Split the input pose paths into chunks based on max_cores and a hard limit of 100 poses per command.

  3. Launch worker commands that call calc_rog_of_pdb(pose_path, chain="A").

  4. Save intermediate JSON files such as out_0.json.

  5. Merge the combined results back into poses.df as rog_data, rog_description, and rog_location.

This module is intended for simple, embarrassingly parallel per-pose metrics. If your function needs multiple inputs, non-JSON output, or a richer output schema than a single data column, a dedicated runner is usually a better fit.

class protflow.metrics.generic_metric_runner.GenericMetric(python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]

Bases: Runner

Run a simple Python metric function over every pose in a Poses.

GenericMetric is the most lightweight metric runner in ProtFlow. You point it at an importable module and a function name, optionally provide a shared options dictionary, and the runner takes care of chunking the pose list, dispatching jobs through a JobStarter, collecting the JSON outputs, and merging the results back into poses.df.

The target function contract is intentionally small:

  • The first positional argument must be the pose path.

  • Optional keyword arguments can be supplied via options.

  • The return value must be serializable to JSON.

The resulting metric value is stored in <prefix>_data after the run is merged back into poses.df.

Parameters:
__init__(python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]

Initialize a generic per-pose metric runner.

Parameters:
  • python_path (str | None, optional) – Python interpreter used to launch worker commands. If omitted, the interpreter from the configured PROTFLOW_ENV is used.

  • module (str | None, optional) – Importable module path that contains the target metric function.

  • function (str | None, optional) – Name of the function to call inside module.

  • options (dict | None, optional) – Keyword arguments forwarded to the target function for every pose.

  • jobstarter (JobStarter | None, optional) – Default jobstarter used when run() is called without one.

  • overwrite (bool, optional) – Whether existing runner scorefiles should be recomputed by default.

run(poses, prefix, python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]

Execute the configured metric function across all poses.

Parameters:
  • poses (Poses) – Input poses. GenericMetric reads the pose file paths from poses.df["poses"].

  • prefix (str) – Prefix used for the runner work directory, cached scorefile, and merged result columns.

  • python_path (str | None, optional) – Python interpreter used for worker commands. Defaults to the value configured on the runner instance.

  • module (str | None, optional) – Importable module path for the metric function. Defaults to the value configured on the runner instance.

  • function (str | None, optional) – Function name inside module. Defaults to the value configured on the runner instance.

  • options (dict | None, optional) – Shared keyword arguments forwarded to the metric function. Defaults to the value configured on the runner instance.

  • jobstarter (JobStarter | None, optional) – Jobstarter used for this invocation. Resolution priority is run(jobstarter) -> self.jobstarter -> poses.default_jobstarter.

  • overwrite (bool, optional) – If True, recompute the metric even when the cached scorefile already exists.

Returns:

The input Poses instance with additional columns such as <prefix>_data, <prefix>_description, and <prefix>_location merged into poses.df.

Return type:

Poses

Raises:
  • ValueError – If options is not a dictionary or if no usable jobstarter is available.

  • RuntimeError – If fewer output rows are collected than input poses, which usually indicates failed worker jobs.

Examples

from protflow.metrics.generic_metric_runner import GenericMetric

rog = GenericMetric(
    module="protflow.utils.metrics",
    function="calc_rog_of_pdb",
    options={"chain": "A"},
)

poses = rog.run(poses=poses, prefix="rog", jobstarter=cpu_jobstarter)

Notes

Internally, run() launches this module as a worker script for each pose chunk. Each worker writes a JSON file with the columns data, description, and location. The parent process concatenates those files and lets RunnerOutput merge the final table back into poses.df.

set_function(function)[source]

Set the function name to import from self.module.

Parameters:

function (str) – Attribute name of the target metric function.

Return type:

None

set_jobstarter(jobstarter)[source]

Set the default jobstarter for this runner instance.

Parameters:

jobstarter (JobStarter | None) – Jobstarter used when run() does not receive one explicitly.

Raises:

ValueError – If jobstarter is neither None nor a JobStarter.

Return type:

None

set_module(module)[source]

Set the importable module path that contains the metric function.

Parameters:

module (str) – Importable module path, for example "protflow.utils.metrics".

Return type:

None

set_options(options)[source]

Set shared keyword arguments for the metric function.

Parameters:

options (dict | None) – Keyword arguments forwarded as function(pose, **options).

Raises:

ValueError – If options is neither None nor a dictionary.

Return type:

None

set_python_path(python_path)[source]

Set the Python interpreter used for worker execution.

Parameters:

python_path (str)

Return type:

None

protflow.metrics.generic_metric_runner.main(args)[source]

Worker entrypoint used by GenericMetric.run().

The parent runner starts this module as a CLI script, passes a comma- separated list of pose paths plus the import target, and expects a JSON file containing data, description, and location columns.

protflow.metrics.protparam module

ProtParam Module

This module provides the functionality to integrate ProtParam calculations within the ProtFlow framework. It offers tools to compute various protein sequence features using the BioPython Bio.SeqUtils.ProtParam module, handling inputs and outputs efficiently, and processing the resulting data in a structured and automated manner.

Detailed Description

The ProtParam class encapsulates the functionality necessary to execute ProtParam calculations. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of parameter calculations. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of ProtParam into larger computational workflows. It supports the automatic setup of job parameters, execution of ProtParam commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the ProtParam class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the ProtParam process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the ProtParam class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protparam import ProtParam

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ProtParam class
protparam = ProtParam()

# Run the ProtParam calculation process
results = protparam.run(
    poses=poses,
    prefix="experiment_1",
    seq_col=None,
    pH=7,
    overwrite=True,
    jobstarter=jobstarter
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the ProtParam process.

  • Customizability: Users can customize the ProtParam process through multiple parameters, including the pH for determining protein total charge, specific options for the ProtParam script, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate ProtParam calculations into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.metrics.protparam.ProtParam(jobstarter=None, python=None)[source]

Bases: Runner

Class handling the calculation of protparams from sequence using the BioPython Bio.SeqUtils.ProtParam module

Parameters:
__init__(jobstarter=None, python=None)[source]

Initialize the ProtParam class.

This constructor sets up the necessary environment for running ProtParam calculations. It initializes the job starter and sets the path to the Python executable within the ProtFlow environment.

Parameters:
  • jobstarter (str, optional) – The job starter to be used for executing ProtParam commands. If not provided, it defaults to None.

  • default_python (str, optional) – The path to the Python executable within the ProtFlow environment. The default value is constructed using the PROTFLOW_ENV environment variable.

  • python (str | None)

jobstarter

Stores the job starter to be used for executing ProtParam commands.

Type:

str

python

The path to the Python executable within the ProtFlow environment.

Type:

str

Raises:

FileNotFoundError – If the default Python executable is not found in the specified path.

Parameters:

Examples

Here is an example of how to initialize the ProtParam class:

from protparam import ProtParam

# Initialize the ProtParam class with default settings
protparam = ProtParam()

# Initialize the ProtParam class with a specific job starter
custom_jobstarter = "my_custom_jobstarter"
protparam = ProtParam(jobstarter=custom_jobstarter)

The __init__ method ensures that the ProtParam class is ready to perform protein sequence parameter calculations within the ProtFlow framework, setting up the environment and configurations necessary for successful execution.

run(poses, prefix, seq_col=None, pH=7, overwrite=False, jobstarter=None)[source]

ProtParam Class

The ProtParam class is a specialized class designed to facilitate the calculation of protein sequence parameters within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with ProtParam calculations.

Detailed Description

The ProtParam class manages all aspects of running ProtParam calculations. It handles the configuration of necessary scripts and executables, prepares the environment for sequence feature calculations, and executes the ProtParam commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to ProtParam scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of ProtParam commands with support for various input types.

  • Collecting and processing output data into a pandas DataFrame.

  • Customizing the sequence feature calculations based on user-defined parameters such as pH.

rtype:

An instance of the `ProtParam class`, configured to run ProtParam calculations and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the ProtParam class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protparam import ProtParam

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the ProtParam class
protparam = ProtParam()

# Run the ProtParam calculation process
results = protparam.run(
    poses=poses,
    prefix="experiment_1",
    seq_col=None,
    pH=7,
    overwrite=True,
    jobstarter=jobstarter
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the ProtParam calculations to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The ProtParam class is intended for researchers and developers who need to perform ProtParam calculations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

Parameters:
Return type:

None

protflow.metrics.protparam.main(args)[source]

Runs protparams.

protflow.metrics.rmsd module

RMSD Module

This module provides the functionality to calculate Root Mean Square Deviation (RMSD) values for protein structures within the ProtFlow framework. It offers tools to run RMSD calculations, handle inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The BackboneRMSD and MotifRMSD classes encapsulate the functionality necessary to execute RMSD calculations. These classes manage the configuration of paths to essential scripts and Python executables, set up the environment, and handle the execution of RMSD calculations. They also include methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of RMSD calculations into larger computational workflows. It supports the automatic setup of job parameters, execution of RMSD commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the BackboneRMSD or MotifRMSD class and invoke their run methods with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the RMSD calculation process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the BackboneRMSD class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import BackboneRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Run the RMSD calculation
results = backbone_rmsd.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    ref_col="reference",
    chains=["A", "B"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the RMSD calculation process.

  • Customizability: Users can customize the RMSD calculation process through multiple parameters, including the specific atoms and chains to be used in the calculation, as well as jobstarter configurations.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate RMSD calculations into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.metrics.rmsd.AtomRMSD(ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=False, jobstarter=None, overwrite=False)[source]

Bases: Runner

Runner for atom-level BioPython RMSD calculations.

Parameters:
__init__(ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=False, jobstarter=None, overwrite=False)[source]

Initialize an atom-level RMSD runner.

Parameters:
Return type:

None

run(poses, prefix, jobstarter=None, overwrite=False, ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=None)[source]

Run atom-level RMSD calculation and merge the resulting scores into poses.

Parameters:
Return type:

Poses

setup_input_dict(poses, ref_col, ref_path, ref_atoms, ref_superimpose_atoms, target_atoms, target_superimpose_atoms, return_superimposed)[source]

Set up the JSON input dictionary for calc_atom_rmsd.py.

Parameters:
Return type:

dict[str, dict[str, Any]]

class protflow.metrics.rmsd.BackboneRMSD(ref_col=None, atoms=['CA'], chains=None, overwrite=False, jobstarter=None)[source]

Bases: Runner

BackboneRMSD Class

The BackboneRMSD class is a specialized class designed to facilitate the calculation of backbone RMSD values within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with RMSD calculations.

Detailed Description

The BackboneRMSD class manages all aspects of calculating RMSD for protein backbones. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to RMSD calculation scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of RMSD commands with support for different atoms and chains.

  • Collecting and processing output data into a pandas DataFrame.

  • Managing overwrite options and handling existing score files.

rtype:

An instance of the `BackboneRMSD class`, configured to run RMSD calculations and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the BackboneRMSD class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import BackboneRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = LocalJobStarter(max_cores=4)

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Run the RMSD calculation
results = backbone_rmsd.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    ref_col="reference_location",
    chains=["A", "B"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the RMSD calculation process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The BackboneRMSD class is intended for researchers and developers who need to perform backbone RMSD calculations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(ref_col=None, atoms=['CA'], chains=None, overwrite=False, jobstarter=None)[source]

Initialize the BackboneRMSD class.

This constructor sets up the BackboneRMSD instance with default or provided parameters. It configures the reference column, atoms, chains, jobstarter, and overwrite options for RMSD calculations.

Parameters:
  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • atoms (list[str], optional) – The list of atom names to calculate RMSD over. Defaults to [“CA”].

  • chains (list[str], optional) – The list of chain names to calculate RMSD over. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • jobstarter (str, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.

Returns:

None

Examples

Here is an example of how to initialize the BackboneRMSD class:

from rmsd import BackboneRMSD

# Initialize the BackboneRMSD class with default parameters
backbone_rmsd = BackboneRMSD()

# Initialize the BackboneRMSD class with custom parameters
backbone_rmsd = BackboneRMSD(ref_col="reference", atoms=["CA", "CB"], chains=["A", "B"], overwrite=True, jobstarter="custom_starter")
Further Details:
  • Default Values: If no parameters are provided, the class initializes with default values suitable for basic RMSD calculations.

  • Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.

  • Custom Configuration: Users can customize the RMSD calculation process by providing specific values for the reference column, atoms, chains, and jobstarter.

calc_all_atom_rmsd()[source]

Method to calculate all-atom RMSD between poses

Return type:

None

run(poses, prefix, ref_col=None, jobstarter=None, chains=None, overwrite=False)[source]

Calculate the backbone RMSD for given poses and jobstarter configuration.

This method sets up and runs the RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • chains (list[str], optional) – A list of chain names to calculate RMSD over. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the RMSD calculation.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If chains are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import BackboneRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = LocalJobStarter(max_cores=4)

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Run the RMSD calculation
results = backbone_rmsd.run(
    poses=poses,
    prefix="experiment_1",
    jobstarter=jobstarter,
    ref_col="reference",
    chains=["A", "B"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.

  • Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running RMSD calculations using BioPython.

  • Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the RMSD calculation process to their specific needs, including specifying atoms and chains for RMSD calculations.

This method is designed to streamline the execution of backbone RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze RMSD calculations.

set_atoms(atoms)[source]

Set the atoms for RMSD calculations.

This method sets the list of atom names to calculate RMSD over. If “all” is provided, all atoms will be considered.

Parameters:

atoms (list[str]) – The list of atom names to calculate RMSD over.

Returns:

None

Raises:

TypeError – If atoms is not a list of strings.

Return type:

None

Examples

Here is an example of how to use the set_atoms method:

from rmsd import BackboneRMSD

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Set the atoms for RMSD calculation
backbone_rmsd.set_atoms(["CA", "CB"])
Further Details:
  • Usage: The list of atoms specifies which atoms in the protein backbone will be considered during RMSD calculations.

  • Validation: The method includes validation to ensure that the atoms parameter is a list of strings, representing valid atom names.

  • Flexibility: Users can specify any set of atoms or choose to include all atoms by setting the parameter to “all”.

set_chains(chains)[source]

Set the chains for RMSD calculations.

This method sets the list of chain names to calculate RMSD over. It ensures that the provided chains parameter is a list of strings or a single string representing chain names.

Parameters:

chains (list[str] or str) – The list of chain names or a single chain name to calculate RMSD over.

Returns:

None

Raises:

TypeError – If chains is not a list of strings or a single string.

Return type:

None

Examples

Here is an example of how to use the set_chains method:

from rmsd import BackboneRMSD

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Set the chains for RMSD calculation
backbone_rmsd.set_chains(["A", "B"])

# Alternatively, set a single chain
backbone_rmsd.set_chains("A")
Further Details:
  • Usage: The chains parameter specifies which chains in the protein structure will be considered during RMSD calculations.

  • Validation: The method includes validation to ensure that the chains parameter is either a list of strings or a single string, representing valid chain names.

  • Flexibility: Users can specify multiple chains as a list or a single chain as a string, providing flexibility in how the RMSD calculations are configured.

set_jobstarter(jobstarter)[source]

Set the jobstarter configuration for the BackboneRMSD runner.

This method sets the jobstarter configuration to be used in the RMSD calculation process.

Parameters:

jobstarter (JobStarter) – The jobstarter configuration for running the RMSD calculations.

Returns:

None

Raises:

TypeError – If jobstarter is not of type JobStarter.

Return type:

None

Examples

Here is an example of how to use the set_jobstarter method:

from rmsd import BackboneRMSD

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Set the jobstarter configuration
backbone_rmsd.set_jobstarter("custom_starter")
Further Details:
  • Usage: The jobstarter configuration specifies how the RMSD calculations will be managed and executed, particularly in HPC environments.

  • Validation: The method includes validation to ensure that the jobstarter parameter is of the correct type.

  • Integration: The jobstarter configuration set by this method is used by other methods in the class to manage the execution of RMSD calculations.

set_ref_col(ref_col)[source]

Set the reference column for RMSD calculations.

This method sets the default reference column to be used in the RMSD calculation process.

Parameters:

ref_col (str) – The reference column for RMSD calculations.

Returns:

None

Raises:

TypeError – If ref_col is not of type string.

Return type:

None

Examples

Here is an example of how to use the set_ref_col method:

from rmsd import BackboneRMSD

# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()

# Set the reference column
backbone_rmsd.set_ref_col("reference")
Further Details:
  • Usage: The reference column is used to identify which column in the input data contains the reference structures for RMSD calculation.

  • Validation: The method includes validation to ensure that the reference column is of the correct type.

  • Integration: The reference column set by this method is used by other methods in the class to perform RMSD calculations.

Parameters:
class protflow.metrics.rmsd.MotifRMSD(ref_col=None, target_motif=None, ref_motif=None, atoms=None, return_superimposed_poses=False, jobstarter=None, overwrite=False)[source]

Bases: Runner

MotifRMSD Class

The MotifRMSD class is a specialized class designed to facilitate the calculation of RMSD values for specific motifs within protein structures in the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with motif-specific RMSD calculations.

Detailed Description

The MotifRMSD class manages all aspects of calculating RMSD for specified motifs within protein structures. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to motif RMSD calculation scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of RMSD commands with support for various motifs and chains.

  • Collecting and processing output data into a pandas DataFrame.

  • Managing overwrite options and handling existing score files.

rtype:

An instance of the `MotifRMSD class`, configured to run motif RMSD calculations and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the MotifRMSD class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import MotifRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the MotifRMSD class
motif_rmsd = MotifRMSD()

# Run the motif RMSD calculation
results = motif_rmsd.run(
    poses=poses,
    prefix="experiment_2",
    jobstarter=jobstarter,
    ref_col="reference",
    ref_motif="motif_A",
    target_motif="motif_B",
    atoms=["CA", "CB"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the motif RMSD calculation process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The MotifRMSD class is intended for researchers and developers who need to perform RMSD calculations for specific motifs as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(ref_col=None, target_motif=None, ref_motif=None, atoms=None, return_superimposed_poses=False, jobstarter=None, overwrite=False)[source]

Initialize the MotifRMSD class.

This constructor sets up the MotifRMSD instance with default or provided parameters. It configures the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite options for RMSD calculations.

Parameters:
  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • target_motif (str, optional) – The target motif for RMSD calculations. Defaults to None.

  • ref_motif (str, optional) – The reference motif for RMSD calculations. Defaults to None.

  • target_chains (list[str], optional) – The list of chain names for the target motif. Defaults to None.

  • ref_chains (list[str], optional) – The list of chain names for the reference motif. Defaults to None.

  • jobstarter (JobStarter, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • atoms (list[str])

  • return_superimposed_poses (bool)

Returns:

None

Examples

Here is an example of how to initialize the MotifRMSD class:

from rmsd import MotifRMSD

# Initialize the MotifRMSD class with default parameters
motif_rmsd = MotifRMSD()

# Initialize the MotifRMSD class with custom parameters
motif_rmsd = MotifRMSD(
    ref_col="reference",
    target_motif="motif_A",
    ref_motif="motif_B",
    target_chains=["A"],
    ref_chains=["B"],
    jobstarter=JobStarter(),
    overwrite=True
)
Further Details:
  • Default Values: If no parameters are provided, the class initializes with default values suitable for basic motif-specific RMSD calculations.

  • Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.

  • Custom Configuration: Users can customize the motif RMSD calculation process by providing specific values for the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite option.

run(poses, prefix, jobstarter=None, ref_col=None, ref_motif=None, target_motif=None, atoms=None, return_superimposed_poses=False, overwrite=False)[source]

Calculate the motif-specific RMSD for given poses and jobstarter configuration.

This method sets up and runs the motif-specific RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • ref_motif (Any, optional) – The reference motif for RMSD calculations. Defaults to None.

  • target_motif (Any, optional) – The target motif for RMSD calculations. Defaults to None.

  • atoms (list[str], optional) – The list of atom names to calculate RMSD over. Defaults to None.

  • return_superimposed_poses (bool, optional) – If True, return superimposed poses as new poses.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the RMSD calculation.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If motifs or atoms are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import MotifRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the MotifRMSD class
motif_rmsd = MotifRMSD()

# Run the motif RMSD calculation
results = motif_rmsd.run(
    poses=poses,
    prefix="experiment_2",
    jobstarter=jobstarter,
    ref_col="reference",
    ref_motif="motif_A",
    target_motif="motif_B",
    atoms=["CA", "CB"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.

  • Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running motif-specific RMSD calculations.

  • Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the motif RMSD calculation process to their specific needs, including specifying reference and target motifs, as well as atoms for RMSD calculations.

This method is designed to streamline the execution of motif-specific RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze motif-specific RMSD calculations.

set_atoms(atoms=None)[source]

Set the atoms used for superposition and RMSD calculations.

Parameters:

atoms (list[str]) – The atoms used for superposition.

Return type:

None

set_jobstarter(jobstarter)[source]

Set the jobstarter configuration for the MotifRMSD runner.

Parameters:

jobstarter (JobStarter) – The jobstarter configuration.

Raises:

ValueError – If jobstarter is not of type JobStarter.

Return type:

None

set_ref_col(col)[source]

Set the reference column for RMSD calculations.

Parameters:

col (str) – The reference column name.

Return type:

None

set_ref_motif(motif)[source]

Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

set_return_superimposed_poses(return_superimposed_poses)[source]

Method to set if superimposed poses should be returned. :return_superimposed_poses: has to be bool

Parameters:

return_superimposed_poses (bool)

Return type:

None

set_target_motif(motif)[source]

Method to set target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

setup_input_dict(poses, ref_col, ref_motif=None, target_motif=None)[source]

Set up the input dictionary for motif RMSD calculations.

This method prepares a dictionary that can be written to a JSON file and used as input for the motif RMSD calculation script. The dictionary contains mappings of poses to reference PDB files, target motifs, and reference motifs.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • ref_col (str) – The reference column for RMSD calculations.

  • ref_motif (Any, optional) – The reference motif for RMSD calculations. Defaults to None.

  • target_motif (Any, optional) – The target motif for RMSD calculations. Defaults to None.

Returns:

A dictionary structured for input to the motif RMSD calculation script.

Return type:

dict

Raises:

TypeError – If ref_motif or target_motif is not of the expected type.

Examples

Here is an example of how to use the setup_input_dict method:

from rmsd import MotifRMSD
from protflow.poses import Poses

# Initialize the MotifRMSD class
motif_rmsd = MotifRMSD()

# Create a Poses object
poses = Poses()

# Set up the input dictionary for RMSD calculations
input_dict = motif_rmsd.setup_input_dict(
    poses=poses,
    ref_col="reference",
    ref_motif="motif_A",
    target_motif="motif_B"
)

# Print the input dictionary
print(input_dict)
Further Details:
  • Dictionary Structure: The input dictionary maps each pose to its reference PDB file, target motif, and reference motif.

  • Parameter Handling: The method handles different types of inputs for motifs, ensuring that they are correctly formatted for the RMSD calculation script.

  • Integration: The input dictionary prepared by this method is used by the run method to execute motif RMSD calculations.

Parameters:
class protflow.metrics.rmsd.MotifSeparateSuperpositionRMSD(ref_col=None, super_target_motif=None, super_ref_motif=None, super_atoms=None, rmsd_target_motif=None, rmsd_ref_motif=None, rmsd_atoms=None, super_include_het_atoms=False, rmsd_include_het_atoms=False, jobstarter=None, overwrite=False)[source]

Bases: Runner

MotifSeparateSuperpositionRMSD Class

The MotifSeparateSuperpositionRMSD class is a specialized class designed to facilitate the separate superposition and calculation of RMSD values for specific motifs within protein structures in the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with motif-specific superposition and RMSD calculations.

Detailed Description

The MotifSeparateSuperpositionRMSD class manages all aspects of superpositioning on one motif and calculating RMSD for another within protein structures. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to motif RMSD calculation scripts and Python executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of RMSD commands with support for various motifs and chains.

  • Collecting and processing output data into a pandas DataFrame.

  • Managing overwrite options and handling existing score files.

rtype:

An instance of the `MotifSeparateSuperpositionRMSD class`, configured to run motif RMSD calculations and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises TypeError:

Examples

Here is an example of how to initialize and use the MotifRMSD class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import MotifSeparateSuperpositionRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the MotifRMSD class
motif_rmsd = MotifSeparateSuperpositionRMSD()

# Run the motif RMSD calculation
results = motif_rmsd.run(
    poses=poses,
    prefix="experiment_2",
    jobstarter=jobstarter,
    ref_col="reference",
    super_ref_motif="motif_A",
    super_target_motif="motif_B",
    super_atoms=["CA", "CB"],
    rmsd_ref_motif="motif_C",
    rmsd_target_motif=""motif_D",
    rmsd_atoms = ["CA"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the motif RMSD calculation process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The MotifSeparateSuperpositionRMSD class is intended for researchers and developers who need to perform RMSD calculations for specific motifs as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(ref_col=None, super_target_motif=None, super_ref_motif=None, super_atoms=None, rmsd_target_motif=None, rmsd_ref_motif=None, rmsd_atoms=None, super_include_het_atoms=False, rmsd_include_het_atoms=False, jobstarter=None, overwrite=False)[source]

Initialize the MotifSeparateSuperpositionRMSD class.

This constructor sets up the MotifSeparateSuperpositionRMSD instance with default or provided parameters. It configures the reference column, superposition target motif, superposition reference motif, rmsd target motif, rmsd reference motif, inclusion of hetero atoms, jobstarter, and overwrite options for RMSD calculations.

Parameters:
  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • super_target_motif (str, optional) – The target motif for superpositioning. Defaults to None.

  • super_ref_motif (str, optional) – The reference motif for superpositioning. Defaults to None.

  • super_atoms (list, optional) – The atom names for superpositioning. Defaults to None.

  • super_include_het_atoms (bool, optional) – Inclusion of heteroatoms (e.g. from ligands) in superpositioning. Defaults to False.

  • rmsd_target_motif (str, optional) – The target motif for RMSD calculations. Defaults to None.

  • rmsd_ref_motif (str, optional) – The reference motif for RMSD calculations. Defaults to None.

  • rmsd_atoms (list, optional) – The atom names for RMSD calculations. Defaults to None.

  • rmsd_include_het_atoms (bool, optional) – Inclusion of heteroatoms (e.g. from ligands) for RMSD calculations. Defaults to False.

  • jobstarter (JobStarter, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

None

Examples

Here is an example of how to initialize the MotifRMSD class:

from rmsd import MotifSeparateSuperpositionRMSD

# Initialize the MotifSeparateSuperpositionRMSD class with default parameters
motif_rmsd = MotifSeparateSuperpositionRMSD()

# Initialize the MotifSeparateSuperpositionRMSD class with custom parameters
motif_rmsd = MotifSeparateSuperpositionRMSD(
    ref_col="reference",
    super_ref_motif="motif_A",
    super_target_motif="motif_B",
    super_atoms=["CA", "CB"],
    rmsd_ref_motif="motif_C",
    rmsd_target_motif=""motif_D",
    rmsd_atoms = None,
    rmsd_include_het_atoms = True,
    overwrite=True
)
Further Details:
  • Default Values: If no parameters are provided, the class initializes with default values suitable for basic motif-specific RMSD calculations.

  • Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.

  • Custom Configuration: Users can customize the motif RMSD calculation process by providing specific values for the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite option.

run(poses, prefix, jobstarter=None, ref_col=None, super_ref_motif=None, super_target_motif=None, super_atoms=None, rmsd_ref_motif=None, rmsd_target_motif=None, rmsd_atoms=None, rmsd_include_het_atoms=False, super_include_het_atoms=False, overwrite=False)[source]

Superposition on one motif and calculate the RMSD on another for given poses and jobstarter configuration.

This method sets up and runs the motif-specific superposition and RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

  • ref_col (str, optional) – The reference column for RMSD calculations. Defaults to None.

  • super_target_motif (str, optional) – The target motif for superpositioning. Defaults to None.

  • super_ref_motif (str, optional) – The reference motif for superpositioning. Defaults to None.

  • super_atoms (list, optional) – The atom names for superpositioning. Defaults to None.

  • super_include_het_atoms (bool, optional) – Inclusion of heteroatoms (e.g. from ligands) in superpositioning. Defaults to False.

  • rmsd_target_motif (str, optional) – The target motif for RMSD calculations. Defaults to None.

  • rmsd_ref_motif (str, optional) – The reference motif for RMSD calculations. Defaults to None.

  • rmsd_atoms (list, optional) – The atom names for RMSD calculations. Defaults to None.

  • rmsd_include_het_atoms (bool, optional) – Inclusion of heteroatoms (e.g. from ligands) for RMSD calculations. Defaults to False.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

Returns:

An instance of the Poses class, containing the processed poses and results of the RMSD calculation.

Return type:

Poses

Raises:
  • FileNotFoundError – If required files or directories are not found during the execution process.

  • ValueError – If invalid arguments are provided to the method.

  • TypeError – If motifs or atoms are not of the expected type.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import MotifRMSD

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the MotifSeparateSuperpositionRMSD class
motif_rmsd = MotifSeparateSuperpositionRMSD()

# Run the motif RMSD calculation
results = motif_rmsd.run(
    poses=poses,
    prefix="experiment_2",
    jobstarter=jobstarter,
    ref_col="reference",
    super_ref_motif="motif_A",
    super_target_motif="motif_B",
    super_atoms=["CA", "CB"],
    rmsd_ref_motif="motif_C",
    rmsd_target_motif=""motif_D",
    rmsd_atoms = ["CA"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.

  • Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running motif-specific RMSD calculations.

  • Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the motif RMSD calculation process to their specific needs, including specifying reference and target motifs, as well as atoms for RMSD calculations.

This method is designed to streamline the execution of motif-specific RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze motif-specific RMSD calculations.

set_jobstarter(jobstarter)[source]

Set the jobstarter configuration for the MotifRMSD runner.

Parameters:

jobstarter (JobStarter) – The jobstarter configuration.

Raises:

ValueError – If jobstarter is not of type JobStarter.

Return type:

None

set_ref_col(col)[source]

Set the reference column for RMSD calculations.

Parameters:

col (str) – The reference column name.

Return type:

None

set_rmsd_atoms(atoms=None)[source]

Set the atoms used for RMSD calculations.

Parameters:

atoms (list[str]) – The atoms used for superposition.

Return type:

None

set_rmsd_include_het_atoms(include_het_atoms)[source]

Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

include_het_atoms (bool)

Return type:

None

set_rmsd_ref_motif(motif)[source]

Method to set rmsd reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

set_rmsd_target_motif(motif)[source]

Method to set rmsd target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

set_super_atoms(atoms=None)[source]

Set the atoms used for superposition and RMSD calculations.

Parameters:

atoms (list[str]) – The atoms used for superposition.

Return type:

None

set_super_include_het_atoms(include_het_atoms)[source]

Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

include_het_atoms (bool)

Return type:

None

set_super_ref_motif(motif)[source]

Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

set_super_target_motif(motif)[source]

Method to set target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function

Parameters:

motif (str)

Return type:

None

setup_input_dict(poses, ref_col, ref_motif=None, target_motif=None, rmsd_ref_motif=None, rmsd_target_motif=None)[source]

Set up the input dictionary for motif RMSD calculations.

This method prepares a dictionary that can be written to a JSON file and used as input for the motif RMSD calculation script. The dictionary contains mappings of poses to reference PDB files, target motifs, and reference motifs.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • ref_col (str) – The reference column for RMSD calculations.

  • ref_motif (Any, optional) – The reference motif for superposition. Defaults to None.

  • target_motif (Any, optional) – The target motif for superposition. Defaults to None.

  • rmsd_ref_motif (Any, optional) – The reference motif for RMSD calculations. Defaults to None.

  • rmsd_target_motif (Any, optional) – The target motif for RMSD calculations. Defaults to None.

Returns:

A dictionary structured for input to the motif RMSD calculation script.

Return type:

dict

Raises:

TypeError – If ref_motif or target_motif is not of the expected type.

Examples

Here is an example of how to use the setup_input_dict method:

from rmsd import MotifRMSD
from protflow.poses import Poses

# Initialize the MotifRMSD class
motif_rmsd = MotifRMSD()

# Create a Poses object
poses = Poses()

# Set up the input dictionary for RMSD calculations
input_dict = motif_rmsd.setup_input_dict(
    poses=poses,
    ref_col="reference",
    ref_motif="motif_A",
    target_motif="motif_B"
)

# Print the input dictionary
print(input_dict)
Further Details:
  • Dictionary Structure: The input dictionary maps each pose to its reference PDB file, target motif, and reference motif.

  • Parameter Handling: The method handles different types of inputs for motifs, ensuring that they are correctly formatted for the RMSD calculation script.

  • Integration: The input dictionary prepared by this method is used by the run method to execute motif RMSD calculations.

Parameters:
  • ref_col (str)

  • super_target_motif (str)

  • super_ref_motif (str)

  • super_atoms (list[str])

  • rmsd_target_motif (str)

  • rmsd_ref_motif (str)

  • rmsd_atoms (list[str])

  • super_include_het_atoms (bool)

  • rmsd_include_het_atoms (bool)

  • jobstarter (JobStarter)

  • overwrite (bool)

protflow.metrics.tmscore module

TMscore Module

This module provides the functionality to integrate TMscore calculations within the ProtFlow framework. It offers tools to run TMscore and TMalign, handle their inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description

The TMalign and TMscore classes encapsulate the functionality necessary to execute TM-align and TM-score runs, respectively. These classes manage the configuration of paths to essential scripts and Python executables, set up the environment, and handle the execution of scoring processes. They include methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of TM-align and TM-score into larger computational workflows. It supports the automatic setup of job parameters, execution of TM-align/TM-score commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage

To use this module, create an instance of the TMalign or TMscore class and invoke their run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the scoring process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples

Here is an example of how to initialize and use the TMalign class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMalign

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMalign class
tmalign = TMalign()

# Run the alignment process
results = tmalign.run(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_pdb",
    sc_tm_score=True,
    options="-a",
    pose_options=["-b"],
    overwrite=True
)

# Access and process the results
print(results)

Here is an example of how to initialize and use the TMscore class within a ProtFlow pipeline:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMscore

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMscore class
tmscore = TMscore()

# Run the scoring process
results = tmscore.run(
    poses=poses,
    prefix="experiment_2",
    ref_col="reference_pdb",
    options="-c",
    pose_options=["-d"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the scoring process.

  • Customizability: Users can customize the scoring process through multiple parameters, including specific options for the TM-align or TM-score scripts, and options for handling pose-specific parameters.

  • Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate TM-align or TM-score into their protein structure comparison and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.metrics.tmscore.TMalign(jobstarter=None, application=None)[source]

Bases: Runner

TMalign Class

The TMalign class is a specialized class designed to facilitate the execution of TMalign within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with TMalign processes.

Detailed Description

The TMalign class manages all aspects of running TMalign simulations. It handles the configuration of necessary scripts and executables, prepares the environment for alignment processes, and executes the alignment commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to TMalign executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of TMalign commands with support for various alignment options.

  • Collecting and processing output data into a pandas DataFrame.

  • Normalizing TM scores based on the reference structure and calculating self-consistency scores.

rtype:

An instance of the `TMalign class`, configured to run TMalign processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises RuntimeError:

Examples

Here is an example of how to initialize and use the TMalign class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMalign

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMalign class
tmalign = TMalign()

# Run the alignment process
results = tmalign.run(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_pdb",
    sc_tm_score=True,
    options="-a",
    pose_options=["-b"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the alignment process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

Difference Between TMscore and TMalign

  • TMscore: This class calculates the TM-score between protein structures without superimposing them. It is suitable for comparing the overall similarity of protein structures in a sequence-length independent manner. TMscore is used when you need to score the structural similarity directly without modifying the positions of the structures.

  • TMalign: This class not only calculates the TM-score but also superimposes the structures before scoring. It is used when structural alignment and superimposition are necessary to get a more accurate measure of structural similarity, considering the spatial arrangement of the protein structures.

The TMalign class is intended for researchers and developers who need to perform TMalign alignments as part of their protein structure comparison and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(jobstarter=None, application=None)[source]

Initialize the TMalign class with optional jobstarter and application path.

This method sets up the TMalign class by configuring the jobstarter and the path to the TMalign executable. It ensures that the necessary components are ready for executing TMalign processes.

Parameters:
  • jobstarter (JobStarter, optional) – An optional jobstarter configuration. Defaults to None.

  • application (str, optional) – Path to the TMalign executable. If not provided, it defaults to the TMalign executable in the ProtFlow environment.

Raises:

ValueError – If the TMalign executable is not found in the specified environment.

Examples

Here is an example of how to initialize the TMalign class:

from tmscore import TMalign

# Initialize the TMalign class
tmalign = TMalign(
    jobstarter=LocalJobStarter(max_cores=4),
    application="/path/to/TMalign"
)

# Check the instance
print(tmalign)
Further Details:
  • Jobstarter Configuration: This parameter allows setting up the jobstarter for managing job execution.

  • Application Path: This parameter sets the path to the TMalign executable, ensuring the correct executable is used for alignment processes.

This method is designed to prepare the TMalign class for executing TMalign processes, ensuring that all necessary configurations are in place.

collect_scores(output_dir)[source]

Collect scores from TMalign output files.

This method collects and processes the scores from the output files generated by TMalign. It reads the scores, extracts relevant information, and organizes the data into a structured pandas DataFrame.

Parameters:

output_dir (str) – The directory where TMalign output files are located.

Returns:

A DataFrame containing the collected scores.

Return type:

pd.DataFrame

Raises:

RuntimeError – If no TM scores are found in the output files.

Examples

Here is an example of how to use the collect_scores method:

from tmscore import TMalign

# Initialize the TMalign class
tmalign = TMalign()

# Collect scores
scores_df = tmalign.collect_scores(
    output_dir="output/"
)

# Print the scores DataFrame
print(scores_df)
Further Details:
  • Score Extraction: The method reads the output files, extracts relevant scores, and organizes them into a pandas DataFrame.

  • Validation: Ensures that the scores are correctly extracted and that no errors occurred during the process.

This method is designed to streamline the collection and processing of scores from TMalign output files, ensuring that all relevant data is accurately captured and organized.

prep_ref(ref, poses)[source]

Prepare the reference structures for TMalign.

This method prepares the reference structures for alignment based on the provided reference column or specific PDB file. It ensures that the references are correctly formatted for the TMalign process.

Parameters:
  • ref (str) – The reference structure, either as a path to a PDB file or as a column name in the Poses DataFrame.

  • poses (Poses) – The Poses object containing the protein structures.

Returns:

A list of reference paths for each pose.

Return type:

list[str]

Raises:

ValueError – If the ref parameter is not a string or if the reference column is missing from the Poses DataFrame.

Examples

Here is an example of how to use the prep_ref method:

from protflow.poses import Poses
from tmscore import TMalign

# Create instances of necessary classes
poses = Poses()
tmalign = TMalign()

# Prepare reference structures
ref_list = tmalign.prep_ref(
    ref="reference_pdb",
    poses=poses
)

# Print the reference list
print(ref_list)
Further Details:
  • Reference Handling: The method can handle both a single PDB file and a column name referring to multiple PDB files within the Poses DataFrame.

  • Validation: Ensures that the provided reference is valid and exists in the Poses DataFrame if specified as a column name.

This method is designed to streamline the preparation of reference structures for TMalign processes, ensuring that all references are correctly formatted and validated.

run(poses, prefix, ref_col, sc_tm_score=True, options=None, pose_options=None, overwrite=False, jobstarter=None)[source]

Execute the TMalign process with given poses and jobstarter configuration.

This method sets up and runs the TMalign process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • ref_col (str|) – Column containing paths to PDB files used as reference for TM score calculation. Can also be a path to a singular reference .pdb file.

  • sc_tm_score (bool, optional) – If True, calculates the self-consistency TM score for each backbone in ref_col and adds it into the column {prefix}_sc_tm. Defaults to True.

  • options (str, optional) – Additional command-line options for the TMalign script. Defaults to None.

  • pose_options (str, optional) – Name of poses.df column containing options for TMalign. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the TMalign process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If the TMalign executable is not found in the specified environment.

  • ValueError – If invalid arguments are provided to the method or if required reference columns are missing.

  • RuntimeError – If no TM scores are found in the output files.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMalign

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMalign class
tmalign = TMalign()

# Run the alignment process
results = tmalign.run(
    poses=poses,
    prefix="experiment_1",
    ref_col="reference_pdb",
    sc_tm_score=True,
    options="-a",
    pose_options=["-b"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Reference Preparation: The method prepares the reference structures for alignment based on the provided reference column or specific PDB file.

  • Output Management: The method handles the collection and processing of output data, including merging and normalizing TM scores, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the alignment process to their specific needs.

This method is designed to streamline the execution of TMalign processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein structure alignments.

write_cmd(pose_path, ref_path, output_dir, options=None, pose_options=None)[source]

Write the command to run TMalign.

This method constructs the command to execute TMalign based on the provided parameters. It formats the options and flags correctly and sets up the command to be run in the environment.

Parameters:
  • pose_path (str) – The path to the pose file.

  • ref_path (str) – The path to the reference file.

  • output_dir (str) – The directory where output files will be saved.

  • options (str, optional) – Additional command-line options for TMalign. Defaults to None.

  • pose_options (str, optional) – Pose-specific options for TMalign. Defaults to None.

Returns:

The constructed command string to run TMalign.

Return type:

str

Examples

Here is an example of how to use the write_cmd method:

from tmscore import TMalign

# Initialize the TMalign class
tmalign = TMalign()

# Write the command
cmd = tmalign.write_cmd(
    pose_path="pose.pdb",
    ref_path="reference.pdb",
    output_dir="output/",
    options="-a",
    pose_options="-b"
)

# Print the command
print(cmd)
Further Details:
  • Command Construction: The method constructs the command string by parsing and formatting the provided options and pose-specific options.

  • Output Management: Ensures that the output files are correctly named and saved in the specified directory.

This method is designed to streamline the construction of commands for TMalign processes, ensuring that all necessary options are correctly formatted and included.

Parameters:
class protflow.metrics.tmscore.TMscore(jobstarter=None, application=None)[source]

Bases: Runner

TMscore Class

The TMscore class is a specialized class designed to facilitate the execution of TMscore within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with TMscore processes.

Detailed Description

The TMscore class manages all aspects of running TMscore simulations. It handles the configuration of necessary scripts and executables, prepares the environment for scoring processes, and executes the scoring commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.

Key functionalities include:
  • Setting up paths to TMscore executables.

  • Configuring job starter options, either automatically or manually.

  • Handling the execution of TMscore commands with support for various scoring options.

  • Collecting and processing output data into a pandas DataFrame.

Difference Between TMscore and TMalign

  • TMscore: This class calculates the TM-score between protein structures without superimposing them. It is suitable for comparing the overall similarity of protein structures in a sequence-length independent manner. TMscore is used when you need to score the structural similarity directly without modifying the positions of the structures.

  • TMalign: This class not only calculates the TM-score but also superimposes the structures before scoring. It is used when structural alignment and superimposition are necessary to get a more accurate measure of structural similarity, considering the spatial arrangement of the protein structures.

rtype:

An instance of the `TMscore class`, configured to run TMscore processes and handle outputs efficiently.

raises FileNotFoundError:

raises ValueError:

raises RuntimeError:

Examples

Here is an example of how to initialize and use the TMscore class:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMscore

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMscore class
tmscore = TMscore()

# Run the scoring process
results = tmscore.run(
    poses=poses,
    prefix="experiment_2",
    ref_col="reference_pdb",
    options="-c",
    pose_options=["-d"],
    overwrite=True
)

# Access and process the results
print(results)

Further Details

  • Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.

  • Customization: The class provides extensive customization options through its parameters, allowing users to tailor the scoring process to their specific needs.

  • Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.

The TMscore class is intended for researchers and developers who need to perform TMscore calculations as part of their protein structure comparison and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.

__init__(jobstarter=None, application=None)[source]

Initialize the TMscore class with optional jobstarter and application path.

This method sets up the TMscore class by configuring the jobstarter and the path to the TMscore executable. It ensures that the necessary components are ready for executing TMscore processes.

Parameters:
  • jobstarter (str, optional) – An optional jobstarter configuration. Defaults to None.

  • application (str, optional) – Path to the TMscore executable. If not provided, it defaults to the TMscore executable in the ProtFlow environment.

Examples

Here is an example of how to initialize the TMscore class:

from tmscore import TMscore

# Initialize the TMscore class
tmscore = TMscore(
    jobstarter="local",
    application="/path/to/TMscore"
)

# Check the instance
print(tmscore)
Further Details:
  • Jobstarter Configuration: This parameter allows setting up the jobstarter for managing job execution.

  • Application Path: This parameter sets the path to the TMscore executable, ensuring the correct executable is used for scoring processes.

This method is designed to prepare the TMscore class for executing TMscore processes, ensuring that all necessary configurations are in place.

collect_scores(output_dir)[source]

Collect scores from TMscore output files.

This method collects and processes the scores from the output files generated TMscore. It reads the scores, extracts relevant information, and organizes the data into a structured pandas DataFrame.

Parameters:

output_dir (str) – The directory where TMscore output files are located.

Returns:

A DataFrame containing the collected scores.

Return type:

pd.DataFrame

Raises:

RuntimeError – If no TM scores are found in the output files.

Examples

Here is an example of how to use the collect_scores method:

from tmscore import TMscore

# Initialize the TMscore class
tmalign = TMscore()

# Collect scores
scores_df = tmscore.collect_scores(
    output_dir="output/"
)

# Print the scores DataFrame
print(scores_df)
Further Details:
  • Score Extraction: The method reads the output files, extracts relevant scores, and organizes them into a pandas DataFrame.

  • Validation: Ensures that the scores are correctly extracted and that no errors occurred during the process.

This method is designed to streamline the collection and processing of scores from TMscore output files, ensuring that all relevant data is accurately captured and organized.

run(poses, prefix, ref_col, options=None, pose_options=None, overwrite=False, jobstarter=None)[source]

Execute the TMscore process with given poses and jobstarter configuration.

This method sets up and runs the TMscore process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

Parameters:
  • poses (Poses) – The Poses object containing the protein structures.

  • prefix (str) – A prefix used to name and organize the output files.

  • ref_col (str) – Column containing paths to PDB files used as reference for TM score calculation.

  • options (str, optional) – Additional command-line options for the TMscore script. Defaults to None.

  • pose_options (str, optional) – Name of poses.df column containing options for TMscore. Defaults to None.

  • overwrite (bool, optional) – If True, overwrite existing output files. Defaults to False.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.

Returns:

An instance of the RunnerOutput class, containing the processed poses and results of the TMscore process.

Return type:

RunnerOutput

Raises:
  • FileNotFoundError – If the TMscore executable is not found in the specified environment.

  • ValueError – If invalid arguments are provided to the method or if required reference columns are missing.

  • RuntimeError – If no TM scores are found in the output files.

Examples

Here is an example of how to use the run method:

from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMscore

# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()

# Initialize the TMscore class
tmscore = TMscore()

# Run the scoring process
results = tmscore.run(
    poses=poses,
    prefix="experiment_2",
    ref_col="reference_pdb",
    options="-c",
    pose_options=["-d"],
    overwrite=True
)

# Access and process the results
print(results)
Further Details:
  • Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.

  • Reference Handling: The method validates the reference column and prepares the reference structures for scoring.

  • Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.

  • Customization: Extensive customization options are provided through parameters, allowing users to tailor the scoring process to their specific needs.

This method is designed to streamline the execution of TMscore processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein structure comparisons.

write_cmd(pose_path, ref_path, output_dir, options=None, pose_options=None)[source]

Write the command to run TMscore.

This method constructs the command to execute TMscore based on the provided parameters. It formats the options and flags correctly and sets up the command to be run in the environment.

Parameters:
  • pose_path (str) – The path to the pose file.

  • ref_path (str) – The path to the reference file.

  • output_dir (str) – The directory where output files will be saved.

  • options (str, optional) – Additional command-line options for TMscore. Defaults to None.

  • pose_options (str, optional) – Pose-specific options for TMscore. Defaults to None.

Returns:

The constructed command string to run TMscore.

Return type:

str

Examples

Here is an example of how to use the write_cmd method:

from tmscore import TMscore

# Initialize the TMscore class
tmscore = TMscore()

# Write the command
cmd = tmscore.write_cmd(
    pose_path="pose.pdb",
    ref_path="reference.pdb",
    output_dir="output/",
    options="-a",
    pose_options="-b"
)

# Print the command
print(cmd)
Further Details:
  • Command Construction: The method constructs the command string by parsing and formatting the provided options and pose-specific options.

  • Output Management: Ensures that the output files are correctly named and saved in the specified directory.

This method is designed to streamline the construction of commands for TMscore processes, ensuring that all necessary options are correctly formatted and included.

Parameters:
  • jobstarter (str)

  • application (str)

Module contents

protflow.metrics subpackage init