protflow.metrics package
Submodules
protflow.metrics.fpocket module
FPocket Module
This module provides the functionality to integrate FPocket within the ProtFlow framework. It offers tools to run FPocket, handle its inputs and outputs, and process the resulting data in a structured and automated manner.
Detailed Description
The FPocket class encapsulates the functionality necessary to execute FPocket runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of FPocket processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of FPocket into larger computational workflows. It supports the automatic setup of job parameters, execution of FPocket commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.
Usage
To use this module, create an instance of the FPocket class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the FPocket process is provided through various parameters, allowing for customized runs tailored to specific research needs.
Examples
Here is an example of how to initialize and use the FPocket class within a ProtFlow pipeline:
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from fpocket import FPocket
# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()
# Initialize the FPocket class
fpocket = FPocket()
# Run the FPocket process
results = fpocket.run(
poses=poses,
prefix="experiment_1",
jobstarter=jobstarter,
options="--some-option value",
pose_options=["--specific-option value"],
overwrite=True
)
# Access and process the results
print(results)
Further Details
Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the FPocket process.
Customizability: Users can customize the FPocket process through multiple parameters, including specific options for the FPocket script and options for handling pose-specific parameters.
Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to incorporate FPocket into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Version
0.1.0
- class protflow.metrics.fpocket.FPocket(fpocket_path=None, jobstarter=None)[source]
Bases:
RunnerFPocket Class
The FPocket class is a specialized class designed to facilitate the execution of FPocket within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with FPocket processes.
Detailed Description
The FPocket class manages all aspects of running FPocket simulations. It handles the configuration of necessary scripts and executables, prepares the environment for pocket detection processes, and executes the FPocket commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to FPocket scripts and executables.
Configuring job starter options, either automatically or manually.
Handling the execution of FPocket commands with support for multiple options and pose-specific parameters.
Collecting and processing output data into a pandas DataFrame.
Ensuring robust error handling and logging for easier debugging and verification of the FPocket process.
- rtype:
An instanceofthe `FPocketclass`,configuredtorun FPocket processesandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises TypeError:
Examples
Here is an example of how to initialize and use the FPocket class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from fpocket import FPocket # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the FPocket class fpocket = FPocket() # Run the FPocket process results = fpocket.run( poses=poses, prefix="experiment_1", jobstarter=jobstarter, options="--some-option value", pose_options=["--specific-option value"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the FPocket process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The FPocket class is intended for researchers and developers who need to perform FPocket simulations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(fpocket_path=None, jobstarter=None)[source]
Initialize the FPocket class with the specified path and jobstarter configuration.
This constructor sets up the FPocket instance by configuring the path to the FPocket executable and initializing the jobstarter object. It ensures that the necessary components are in place for running FPocket processes.
- Parameters:
fpocket_path (
str, optional) – The path to the FPocket executable. Defaults to the path specified in the ProtFlow configuration (FPOCKET_PATH).jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.
- Returns:
An instance of the FPocket class, ready to run FPocket processes.
- Raises:
ValueError – If the fpocket_path is not provided or is invalid.
Examples
Here is an example of how to initialize the FPocket class:
from protflow.jobstarters import JobStarter from fpocket import FPocket # Initialize the FPocket class with default settings fpocket = FPocket() # Initialize the FPocket class with a specific jobstarter jobstarter = JobStarter() fpocket = FPocket(jobstarter=jobstarter)
- Further Details:
Path Configuration: Ensures the FPocket executable path is set correctly, raising an error if the path is not provided or invalid.
Job Management: Initializes the jobstarter object to manage the execution of FPocket commands, allowing for integration with job scheduling systems.
- index_layers = 0
- prep_fpocket_options(poses, options, pose_options)[source]
Prepare options for the FPocket process based on given parameters.
This method processes and prepares the options and pose-specific options for the FPocket run. It filters out forbidden options, merges general options with pose-specific options, and formats them for inclusion in the FPocket commands.
- Parameters:
- Returns:
A list of formatted option strings for each pose, ready to be used in the FPocket commands.
- Return type:
- Raises:
TypeError – If options or pose_options are not of the expected type.
Examples
Here is an example of how to use the prep_fpocket_options method:
from protflow.poses import Poses from fpocket import FPocket # Create instances of necessary classes poses = Poses() fpocket = FPocket() # Prepare FPocket options options = "--some-option value" pose_options = ["--specific-option value"] prepared_options = fpocket.prep_fpocket_options(poses, options, pose_options) # Output the prepared options print(prepared_options)
- Further Details:
Option Processing: Merges general and pose-specific options, ensuring that forbidden options are removed and the final option strings are correctly formatted.
Customization: Allows for extensive customization of the FPocket process through both general and pose-specific options, providing flexibility in configuring FPocket runs.
- run(poses, prefix, jobstarter=None, options=None, pose_options=None, return_full_scores=False, overwrite=False)[source]
Execute the FPocket process with given poses and jobstarter configuration.
This method sets up and runs the FPocket process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.options (
strorlist[str], optional) – Additional options for the FPocket script. Defaults to None.pose_options (
strorlist[str], optional) – A list of pose-specific options for the FPocket script. Defaults to None.return_full_scores (
bool, optional) – If True, include detailed scores for each pocket in the output. Defaults to False.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.
- Returns:
An updated Poses object containing the processed poses and results of the FPocket process.
- Return type:
- Raises:
FileNotFoundError – If required files or directories are not found during the execution process.
ValueError – If invalid arguments are provided to the method.
TypeError – If options or pose_options are not of the expected type.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from fpocket import FPocket # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the FPocket class fpocket = FPocket() # Run the FPocket process results = fpocket.run( poses=poses, prefix="experiment_1", jobstarter=jobstarter, options="--some-option value", pose_options=["--specific-option value"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It moves the poses to the working directory and compiles the FPocket commands for execution.
Output Management: The method handles the collection and processing of output data, ensuring that results are organized into a structured DataFrame. It includes the location of each pocket and integrates the results back into the Poses object.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the FPocket process to their specific needs, including the ability to specify additional FPocket options and pose-specific parameters.
This method is designed to streamline the execution of FPocket processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze pocket detection simulations.
- Parameters:
fpocket_path (str | None)
jobstarter (JobStarter)
- protflow.metrics.fpocket.collect_fpocket_output(output_file, return_full_scores=False)[source]
Collect output from a single FPocket output file.
This function processes the output of a single FPocket output file, extracting scores and other relevant information into a pandas DataFrame.
- Parameters:
- Returns:
A DataFrame containing the processed output from the FPocket file.
- Return type:
pd.DataFrame
Examples
Here is an example of how to use the collect_fpocket_output function:
from fpocket import collect_fpocket_output # Specify the output file output_file = "path/to/output_file" # Collect output output = collect_fpocket_output(output_file, return_full_scores=True) # Display the output print(output)
- Further Details:
Output Processing: The function reads the FPocket output file, extracts relevant scores and information, and formats them into a DataFrame.
Detailed Scores: If the return_full_scores parameter is set to True, the function includes detailed scores for each pocket in the DataFrame.
- protflow.metrics.fpocket.collect_fpocket_scores(output_dir, return_full_scores=False)[source]
Collect scores from an FPocket output directory.
This function collects and processes the scores from FPocket output files located in the specified directory. It aggregates the scores into a pandas DataFrame for further analysis.
- Parameters:
- Returns:
A DataFrame containing the collected scores from the FPocket output files.
- Return type:
pd.DataFrame
Examples
Here is an example of how to use the collect_fpocket_scores function:
from fpocket import collect_fpocket_scores # Specify the output directory output_dir = "path/to/output_directory" # Collect scores scores = collect_fpocket_scores(output_dir, return_full_scores=True) # Display the scores print(scores)
- Further Details:
Score Aggregation: The function looks for FPocket output directories, extracts scores from each output file, and combines them into a single DataFrame.
Detailed Scores: If the return_full_scores parameter is set to True, the function includes detailed scores for each pocket in the DataFrame.
- protflow.metrics.fpocket.get_outfile_name(outdir)[source]
Get the name of the output file from the output directory.
This function generates the name of the FPocket output file based on the specified output directory.
- Parameters:
outdir (
str) – The path to the FPocket output directory.- Returns:
The name of the output file within the specified directory.
- Return type:
Examples
Here is an example of how to use the get_outfile_name function:
from fpocket import get_outfile_name # Specify the output directory outdir = "path/to/output_directory" # Get the output file name output_file_name = get_outfile_name(outdir) # Display the output file name print(output_file_name)
- Further Details:
File Naming: The function constructs the output file name by modifying the output directory name and appending the appropriate suffix.
- protflow.metrics.fpocket.parse_fpocket_outfile(output_file)[source]
Parse the FPocket output file to extract scores.
This function reads and parses the FPocket output file, extracting scores and other relevant information into a pandas DataFrame.
- Parameters:
output_file (
str) – The path to the FPocket output file.- Returns:
A DataFrame containing the parsed scores from the FPocket output file.
- Return type:
pd.DataFrame
Examples
Here is an example of how to use the parse_fpocket_outfile function:
from fpocket import parse_fpocket_outfile # Specify the output file output_file = "path/to/output_file" # Parse the output file scores = parse_fpocket_outfile(output_file) # Display the scores print(scores)
- Further Details:
File Parsing: The function reads the FPocket output file, extracts relevant scores and information, and formats them into a DataFrame.
Score Extraction: The function processes the file line by line, extracting score data and organizing it into a structured format.
protflow.metrics.generic_metric_runner module
Generic metric runner for ProtFlow.
This module exposes GenericMetric, a lightweight protflow.runners.Runner
that executes any importable Python function over the poses stored in a
protflow.poses.Poses object. The target function must accept a single
pose path as its first positional argument and return a JSON-serializable value.
Additional keyword arguments can be forwarded through the runner’s options
dictionary.
How it works
GenericMetric.run() resolves the working directory and jobstarter, splits
poses.poses_list() into manageable chunks, and starts one worker command
per chunk. Each worker re-enters this module as a small CLI program, imports
the requested module and function dynamically, evaluates the function on every
pose path in its chunk, and stores the results as JSON. The parent process then
concatenates the worker outputs and merges them back into poses.df through
RunnerOutput.
Walkthrough
The example below calculates the radius of gyration for every pose by reusing
protflow.utils.metrics.calc_rog_of_pdb:
from protflow.poses import Poses
from protflow.jobstarters import SbatchArrayJobstarter
from protflow.metrics.generic_metric_runner import GenericMetric
poses = Poses(
poses=["/data/designs/design_0001.pdb", "/data/designs/design_0002.pdb"],
work_dir="/data/protflow_runs"
)
cpu_jobstarter = SbatchArrayJobstarter(max_cores=10)
rog = GenericMetric(
module="protflow.utils.metrics",
function="calc_rog_of_pdb",
options={"chain": "A"},
jobstarter=cpu_jobstarter,
)
poses = rog.run(poses=poses, prefix="rog")
# GenericMetric stores the returned value in <prefix>_data.
print(poses.df[["poses_description", "rog_data"]])
In that run, GenericMetric will:
Build
/data/protflow_runs/rogas its working directory.Split the input pose paths into chunks based on
max_coresand a hard limit of 100 poses per command.Launch worker commands that call
calc_rog_of_pdb(pose_path, chain="A").Save intermediate JSON files such as
out_0.json.Merge the combined results back into
poses.dfasrog_data,rog_description, androg_location.
This module is intended for simple, embarrassingly parallel per-pose metrics.
If your function needs multiple inputs, non-JSON output, or a richer output
schema than a single data column, a dedicated runner is usually a better
fit.
- class protflow.metrics.generic_metric_runner.GenericMetric(python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]
Bases:
RunnerRun a simple Python metric function over every pose in a
Poses.GenericMetricis the most lightweight metric runner in ProtFlow. You point it at an importable module and a function name, optionally provide a sharedoptionsdictionary, and the runner takes care of chunking the pose list, dispatching jobs through aJobStarter, collecting the JSON outputs, and merging the results back intoposes.df.The target function contract is intentionally small:
The first positional argument must be the pose path.
Optional keyword arguments can be supplied via
options.The return value must be serializable to JSON.
The resulting metric value is stored in
<prefix>_dataafter the run is merged back intoposes.df.- Parameters:
- __init__(python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]
Initialize a generic per-pose metric runner.
- Parameters:
python_path (
str | None, optional) – Python interpreter used to launch worker commands. If omitted, the interpreter from the configuredPROTFLOW_ENVis used.module (
str | None, optional) – Importable module path that contains the target metric function.function (
str | None, optional) – Name of the function to call insidemodule.options (
dict | None, optional) – Keyword arguments forwarded to the target function for every pose.jobstarter (
JobStarter | None, optional) – Default jobstarter used whenrun()is called without one.overwrite (
bool, optional) – Whether existing runner scorefiles should be recomputed by default.
- run(poses, prefix, python_path=None, module=None, function=None, options=None, jobstarter=None, overwrite=False)[source]
Execute the configured metric function across all poses.
- Parameters:
poses (
Poses) – Input poses.GenericMetricreads the pose file paths fromposes.df["poses"].prefix (
str) – Prefix used for the runner work directory, cached scorefile, and merged result columns.python_path (
str | None, optional) – Python interpreter used for worker commands. Defaults to the value configured on the runner instance.module (
str | None, optional) – Importable module path for the metric function. Defaults to the value configured on the runner instance.function (
str | None, optional) – Function name insidemodule. Defaults to the value configured on the runner instance.options (
dict | None, optional) – Shared keyword arguments forwarded to the metric function. Defaults to the value configured on the runner instance.jobstarter (
JobStarter | None, optional) – Jobstarter used for this invocation. Resolution priority isrun(jobstarter)->self.jobstarter->poses.default_jobstarter.overwrite (
bool, optional) – IfTrue, recompute the metric even when the cached scorefile already exists.
- Returns:
The input
Posesinstance with additional columns such as<prefix>_data,<prefix>_description, and<prefix>_locationmerged intoposes.df.- Return type:
Poses- Raises:
ValueError – If
optionsis not a dictionary or if no usable jobstarter is available.RuntimeError – If fewer output rows are collected than input poses, which usually indicates failed worker jobs.
Examples
from protflow.metrics.generic_metric_runner import GenericMetric rog = GenericMetric( module="protflow.utils.metrics", function="calc_rog_of_pdb", options={"chain": "A"}, ) poses = rog.run(poses=poses, prefix="rog", jobstarter=cpu_jobstarter)
Notes
Internally,
run()launches this module as a worker script for each pose chunk. Each worker writes a JSON file with the columnsdata,description, andlocation. The parent process concatenates those files and letsRunnerOutputmerge the final table back intoposes.df.
- set_function(function)[source]
Set the function name to import from
self.module.- Parameters:
function (
str) – Attribute name of the target metric function.- Return type:
None
- set_jobstarter(jobstarter)[source]
Set the default jobstarter for this runner instance.
- Parameters:
jobstarter (
JobStarter | None) – Jobstarter used whenrun()does not receive one explicitly.- Raises:
ValueError – If
jobstarteris neitherNonenor aJobStarter.- Return type:
None
- set_module(module)[source]
Set the importable module path that contains the metric function.
- Parameters:
module (
str) – Importable module path, for example"protflow.utils.metrics".- Return type:
None
- set_options(options)[source]
Set shared keyword arguments for the metric function.
- Parameters:
options (
dict | None) – Keyword arguments forwarded asfunction(pose, **options).- Raises:
ValueError – If
optionsis neitherNonenor a dictionary.- Return type:
None
- protflow.metrics.generic_metric_runner.main(args)[source]
Worker entrypoint used by
GenericMetric.run().The parent runner starts this module as a CLI script, passes a comma- separated list of pose paths plus the import target, and expects a JSON file containing
data,description, andlocationcolumns.
protflow.metrics.protparam module
ProtParam Module
This module provides the functionality to integrate ProtParam calculations within the ProtFlow framework. It offers tools to compute various protein sequence features using the BioPython Bio.SeqUtils.ProtParam module, handling inputs and outputs efficiently, and processing the resulting data in a structured and automated manner.
Detailed Description
The ProtParam class encapsulates the functionality necessary to execute ProtParam calculations. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of parameter calculations. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem. The module is designed to streamline the integration of ProtParam into larger computational workflows. It supports the automatic setup of job parameters, execution of ProtParam commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.
Usage
To use this module, create an instance of the ProtParam class and invoke its run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the ProtParam process is provided through various parameters, allowing for customized runs tailored to specific research needs.
Examples
Here is an example of how to initialize and use the ProtParam class within a ProtFlow pipeline:
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protparam import ProtParam
# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()
# Initialize the ProtParam class
protparam = ProtParam()
# Run the ProtParam calculation process
results = protparam.run(
poses=poses,
prefix="experiment_1",
seq_col=None,
pH=7,
overwrite=True,
jobstarter=jobstarter
)
# Access and process the results
print(results)
Further Details
Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the ProtParam process.
Customizability: Users can customize the ProtParam process through multiple parameters, including the pH for determining protein total charge, specific options for the ProtParam script, and options for handling pose-specific parameters.
Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to incorporate ProtParam calculations into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Version
0.1.0
- class protflow.metrics.protparam.ProtParam(jobstarter=None, python=None)[source]
Bases:
RunnerClass handling the calculation of protparams from sequence using the BioPython Bio.SeqUtils.ProtParam module
- Parameters:
jobstarter (JobStarter)
python (str | None)
- __init__(jobstarter=None, python=None)[source]
Initialize the ProtParam class.
This constructor sets up the necessary environment for running ProtParam calculations. It initializes the job starter and sets the path to the Python executable within the ProtFlow environment.
- Parameters:
jobstarter (
str, optional) – The job starter to be used for executing ProtParam commands. If not provided, it defaults to None.default_python (
str, optional) – The path to the Python executable within the ProtFlow environment. The default value is constructed using the PROTFLOW_ENV environment variable.python (str | None)
- Raises:
FileNotFoundError – If the default Python executable is not found in the specified path.
- Parameters:
jobstarter (JobStarter)
python (str | None)
Examples
Here is an example of how to initialize the ProtParam class:
from protparam import ProtParam # Initialize the ProtParam class with default settings protparam = ProtParam() # Initialize the ProtParam class with a specific job starter custom_jobstarter = "my_custom_jobstarter" protparam = ProtParam(jobstarter=custom_jobstarter)
The __init__ method ensures that the ProtParam class is ready to perform protein sequence parameter calculations within the ProtFlow framework, setting up the environment and configurations necessary for successful execution.
- run(poses, prefix, seq_col=None, pH=7, overwrite=False, jobstarter=None)[source]
ProtParam Class
The ProtParam class is a specialized class designed to facilitate the calculation of protein sequence parameters within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with ProtParam calculations.
Detailed Description
The ProtParam class manages all aspects of running ProtParam calculations. It handles the configuration of necessary scripts and executables, prepares the environment for sequence feature calculations, and executes the ProtParam commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to ProtParam scripts and Python executables.
Configuring job starter options, either automatically or manually.
Handling the execution of ProtParam commands with support for various input types.
Collecting and processing output data into a pandas DataFrame.
Customizing the sequence feature calculations based on user-defined parameters such as pH.
- rtype:
An instanceofthe `ProtParamclass`,configuredtorun ProtParam calculationsandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises TypeError:
Examples
Here is an example of how to initialize and use the ProtParam class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from protparam import ProtParam # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the ProtParam class protparam = ProtParam() # Run the ProtParam calculation process results = protparam.run( poses=poses, prefix="experiment_1", seq_col=None, pH=7, overwrite=True, jobstarter=jobstarter ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the ProtParam calculations to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The ProtParam class is intended for researchers and developers who need to perform ProtParam calculations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- Parameters:
poses (Poses)
prefix (str)
seq_col (str)
pH (float)
jobstarter (JobStarter)
- Return type:
None
protflow.metrics.rmsd module
RMSD Module
This module provides the functionality to calculate Root Mean Square Deviation (RMSD) values for protein structures within the ProtFlow framework. It offers tools to run RMSD calculations, handle inputs and outputs, and process the resulting data in a structured and automated manner.
Detailed Description
The BackboneRMSD and MotifRMSD classes encapsulate the functionality necessary to execute RMSD calculations. These classes manage the configuration of paths to essential scripts and Python executables, set up the environment, and handle the execution of RMSD calculations. They also include methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.
The module is designed to streamline the integration of RMSD calculations into larger computational workflows. It supports the automatic setup of job parameters, execution of RMSD commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.
Usage
To use this module, create an instance of the BackboneRMSD or MotifRMSD class and invoke their run methods with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the RMSD calculation process is provided through various parameters, allowing for customized runs tailored to specific research needs.
Examples
Here is an example of how to initialize and use the BackboneRMSD class within a ProtFlow pipeline:
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from rmsd import BackboneRMSD
# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()
# Initialize the BackboneRMSD class
backbone_rmsd = BackboneRMSD()
# Run the RMSD calculation
results = backbone_rmsd.run(
poses=poses,
prefix="experiment_1",
jobstarter=jobstarter,
ref_col="reference",
chains=["A", "B"],
overwrite=True
)
# Access and process the results
print(results)
Further Details
Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the RMSD calculation process.
Customizability: Users can customize the RMSD calculation process through multiple parameters, including the specific atoms and chains to be used in the calculation, as well as jobstarter configurations.
Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to incorporate RMSD calculations into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Author
Markus Braun, Adrian Tripp
Version
0.1.0
- class protflow.metrics.rmsd.AtomRMSD(ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=False, jobstarter=None, overwrite=False)[source]
Bases:
RunnerRunner for atom-level BioPython RMSD calculations.
- Parameters:
ref_col (str | None)
ref_path (str | None)
ref_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
ref_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
return_superimposed (bool)
jobstarter (JobStarter | None)
overwrite (bool)
- __init__(ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=False, jobstarter=None, overwrite=False)[source]
Initialize an atom-level RMSD runner.
- Parameters:
ref_col (str | None)
ref_path (str | None)
ref_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
ref_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
return_superimposed (bool)
jobstarter (JobStarter | None)
overwrite (bool)
- Return type:
None
- run(poses, prefix, jobstarter=None, overwrite=False, ref_col=None, ref_path=None, ref_atoms=None, ref_superimpose_atoms=None, target_atoms=None, target_superimpose_atoms=None, return_superimposed=None)[source]
Run atom-level RMSD calculation and merge the resulting scores into
poses.- Parameters:
poses (Poses)
prefix (str)
jobstarter (JobStarter | None)
overwrite (bool)
ref_col (str | None)
ref_path (str | None)
ref_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
ref_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
return_superimposed (bool | None)
- Return type:
- setup_input_dict(poses, ref_col, ref_path, ref_atoms, ref_superimpose_atoms, target_atoms, target_superimpose_atoms, return_superimposed)[source]
Set up the JSON input dictionary for
calc_atom_rmsd.py.- Parameters:
poses (Poses)
ref_col (str | None)
ref_path (str | None)
ref_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
ref_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
target_superimpose_atoms (str | tuple[Any, ...] | list[Any] | dict[str, Any] | AtomSelection | None)
return_superimposed (bool)
- Return type:
- class protflow.metrics.rmsd.BackboneRMSD(ref_col=None, atoms=['CA'], chains=None, overwrite=False, jobstarter=None)[source]
Bases:
RunnerBackboneRMSD Class
The BackboneRMSD class is a specialized class designed to facilitate the calculation of backbone RMSD values within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with RMSD calculations.
Detailed Description
The BackboneRMSD class manages all aspects of calculating RMSD for protein backbones. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to RMSD calculation scripts and Python executables.
Configuring job starter options, either automatically or manually.
Handling the execution of RMSD commands with support for different atoms and chains.
Collecting and processing output data into a pandas DataFrame.
Managing overwrite options and handling existing score files.
- rtype:
An instanceofthe `BackboneRMSDclass`,configuredtorun RMSD calculationsandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises TypeError:
Examples
Here is an example of how to initialize and use the BackboneRMSD class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import BackboneRMSD # Create instances of necessary classes poses = Poses() jobstarter = LocalJobStarter(max_cores=4) # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Run the RMSD calculation results = backbone_rmsd.run( poses=poses, prefix="experiment_1", jobstarter=jobstarter, ref_col="reference_location", chains=["A", "B"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the RMSD calculation process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The BackboneRMSD class is intended for researchers and developers who need to perform backbone RMSD calculations as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(ref_col=None, atoms=['CA'], chains=None, overwrite=False, jobstarter=None)[source]
Initialize the BackboneRMSD class.
This constructor sets up the BackboneRMSD instance with default or provided parameters. It configures the reference column, atoms, chains, jobstarter, and overwrite options for RMSD calculations.
- Parameters:
ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.atoms (
list[str], optional) – The list of atom names to calculate RMSD over. Defaults to [“CA”].chains (
list[str], optional) – The list of chain names to calculate RMSD over. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.jobstarter (
str, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.
- Returns:
None
Examples
Here is an example of how to initialize the BackboneRMSD class:
from rmsd import BackboneRMSD # Initialize the BackboneRMSD class with default parameters backbone_rmsd = BackboneRMSD() # Initialize the BackboneRMSD class with custom parameters backbone_rmsd = BackboneRMSD(ref_col="reference", atoms=["CA", "CB"], chains=["A", "B"], overwrite=True, jobstarter="custom_starter")
- Further Details:
Default Values: If no parameters are provided, the class initializes with default values suitable for basic RMSD calculations.
Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.
Custom Configuration: Users can customize the RMSD calculation process by providing specific values for the reference column, atoms, chains, and jobstarter.
- run(poses, prefix, ref_col=None, jobstarter=None, chains=None, overwrite=False)[source]
Calculate the backbone RMSD for given poses and jobstarter configuration.
This method sets up and runs the RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.chains (
list[str], optional) – A list of chain names to calculate RMSD over. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.
- Returns:
An instance of the RunnerOutput class, containing the processed poses and results of the RMSD calculation.
- Return type:
- Raises:
FileNotFoundError – If required files or directories are not found during the execution process.
ValueError – If invalid arguments are provided to the method.
TypeError – If chains are not of the expected type.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import BackboneRMSD # Create instances of necessary classes poses = Poses() jobstarter = LocalJobStarter(max_cores=4) # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Run the RMSD calculation results = backbone_rmsd.run( poses=poses, prefix="experiment_1", jobstarter=jobstarter, ref_col="reference", chains=["A", "B"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.
Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running RMSD calculations using BioPython.
Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the RMSD calculation process to their specific needs, including specifying atoms and chains for RMSD calculations.
This method is designed to streamline the execution of backbone RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze RMSD calculations.
- set_atoms(atoms)[source]
Set the atoms for RMSD calculations.
This method sets the list of atom names to calculate RMSD over. If “all” is provided, all atoms will be considered.
- Parameters:
atoms (
list[str]) – The list of atom names to calculate RMSD over.- Returns:
None
- Raises:
TypeError – If atoms is not a list of strings.
- Return type:
None
Examples
Here is an example of how to use the set_atoms method:
from rmsd import BackboneRMSD # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Set the atoms for RMSD calculation backbone_rmsd.set_atoms(["CA", "CB"])
- Further Details:
Usage: The list of atoms specifies which atoms in the protein backbone will be considered during RMSD calculations.
Validation: The method includes validation to ensure that the atoms parameter is a list of strings, representing valid atom names.
Flexibility: Users can specify any set of atoms or choose to include all atoms by setting the parameter to “all”.
- set_chains(chains)[source]
Set the chains for RMSD calculations.
This method sets the list of chain names to calculate RMSD over. It ensures that the provided chains parameter is a list of strings or a single string representing chain names.
- Parameters:
chains (
list[str]orstr) – The list of chain names or a single chain name to calculate RMSD over.- Returns:
None
- Raises:
TypeError – If chains is not a list of strings or a single string.
- Return type:
None
Examples
Here is an example of how to use the set_chains method:
from rmsd import BackboneRMSD # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Set the chains for RMSD calculation backbone_rmsd.set_chains(["A", "B"]) # Alternatively, set a single chain backbone_rmsd.set_chains("A")
- Further Details:
Usage: The chains parameter specifies which chains in the protein structure will be considered during RMSD calculations.
Validation: The method includes validation to ensure that the chains parameter is either a list of strings or a single string, representing valid chain names.
Flexibility: Users can specify multiple chains as a list or a single chain as a string, providing flexibility in how the RMSD calculations are configured.
- set_jobstarter(jobstarter)[source]
Set the jobstarter configuration for the BackboneRMSD runner.
This method sets the jobstarter configuration to be used in the RMSD calculation process.
- Parameters:
jobstarter (
JobStarter) – The jobstarter configuration for running the RMSD calculations.- Returns:
None
- Raises:
TypeError – If jobstarter is not of type JobStarter.
- Return type:
None
Examples
Here is an example of how to use the set_jobstarter method:
from rmsd import BackboneRMSD # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Set the jobstarter configuration backbone_rmsd.set_jobstarter("custom_starter")
- Further Details:
Usage: The jobstarter configuration specifies how the RMSD calculations will be managed and executed, particularly in HPC environments.
Validation: The method includes validation to ensure that the jobstarter parameter is of the correct type.
Integration: The jobstarter configuration set by this method is used by other methods in the class to manage the execution of RMSD calculations.
- set_ref_col(ref_col)[source]
Set the reference column for RMSD calculations.
This method sets the default reference column to be used in the RMSD calculation process.
- Parameters:
ref_col (
str) – The reference column for RMSD calculations.- Returns:
None
- Raises:
TypeError – If ref_col is not of type string.
- Return type:
None
Examples
Here is an example of how to use the set_ref_col method:
from rmsd import BackboneRMSD # Initialize the BackboneRMSD class backbone_rmsd = BackboneRMSD() # Set the reference column backbone_rmsd.set_ref_col("reference")
- Further Details:
Usage: The reference column is used to identify which column in the input data contains the reference structures for RMSD calculation.
Validation: The method includes validation to ensure that the reference column is of the correct type.
Integration: The reference column set by this method is used by other methods in the class to perform RMSD calculations.
- class protflow.metrics.rmsd.MotifRMSD(ref_col=None, target_motif=None, ref_motif=None, atoms=None, return_superimposed_poses=False, jobstarter=None, overwrite=False)[source]
Bases:
RunnerMotifRMSD Class
The MotifRMSD class is a specialized class designed to facilitate the calculation of RMSD values for specific motifs within protein structures in the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with motif-specific RMSD calculations.
Detailed Description
The MotifRMSD class manages all aspects of calculating RMSD for specified motifs within protein structures. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to motif RMSD calculation scripts and Python executables.
Configuring job starter options, either automatically or manually.
Handling the execution of RMSD commands with support for various motifs and chains.
Collecting and processing output data into a pandas DataFrame.
Managing overwrite options and handling existing score files.
- rtype:
An instanceofthe `MotifRMSDclass`,configuredtorun motif RMSD calculationsandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises TypeError:
Examples
Here is an example of how to initialize and use the MotifRMSD class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import MotifRMSD # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the MotifRMSD class motif_rmsd = MotifRMSD() # Run the motif RMSD calculation results = motif_rmsd.run( poses=poses, prefix="experiment_2", jobstarter=jobstarter, ref_col="reference", ref_motif="motif_A", target_motif="motif_B", atoms=["CA", "CB"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the motif RMSD calculation process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The MotifRMSD class is intended for researchers and developers who need to perform RMSD calculations for specific motifs as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(ref_col=None, target_motif=None, ref_motif=None, atoms=None, return_superimposed_poses=False, jobstarter=None, overwrite=False)[source]
Initialize the MotifRMSD class.
This constructor sets up the MotifRMSD instance with default or provided parameters. It configures the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite options for RMSD calculations.
- Parameters:
ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.target_motif (
str, optional) – The target motif for RMSD calculations. Defaults to None.ref_motif (
str, optional) – The reference motif for RMSD calculations. Defaults to None.target_chains (
list[str], optional) – The list of chain names for the target motif. Defaults to None.ref_chains (
list[str], optional) – The list of chain names for the reference motif. Defaults to None.jobstarter (
JobStarter, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.return_superimposed_poses (bool)
- Returns:
None
Examples
Here is an example of how to initialize the MotifRMSD class:
from rmsd import MotifRMSD # Initialize the MotifRMSD class with default parameters motif_rmsd = MotifRMSD() # Initialize the MotifRMSD class with custom parameters motif_rmsd = MotifRMSD( ref_col="reference", target_motif="motif_A", ref_motif="motif_B", target_chains=["A"], ref_chains=["B"], jobstarter=JobStarter(), overwrite=True )
- Further Details:
Default Values: If no parameters are provided, the class initializes with default values suitable for basic motif-specific RMSD calculations.
Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.
Custom Configuration: Users can customize the motif RMSD calculation process by providing specific values for the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite option.
- run(poses, prefix, jobstarter=None, ref_col=None, ref_motif=None, target_motif=None, atoms=None, return_superimposed_poses=False, overwrite=False)[source]
Calculate the motif-specific RMSD for given poses and jobstarter configuration.
This method sets up and runs the motif-specific RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.ref_motif (
Any, optional) – The reference motif for RMSD calculations. Defaults to None.target_motif (
Any, optional) – The target motif for RMSD calculations. Defaults to None.atoms (
list[str], optional) – The list of atom names to calculate RMSD over. Defaults to None.return_superimposed_poses (
bool, optional) – If True, return superimposed poses as new poses.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.
- Returns:
An instance of the RunnerOutput class, containing the processed poses and results of the RMSD calculation.
- Return type:
- Raises:
FileNotFoundError – If required files or directories are not found during the execution process.
ValueError – If invalid arguments are provided to the method.
TypeError – If motifs or atoms are not of the expected type.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import MotifRMSD # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the MotifRMSD class motif_rmsd = MotifRMSD() # Run the motif RMSD calculation results = motif_rmsd.run( poses=poses, prefix="experiment_2", jobstarter=jobstarter, ref_col="reference", ref_motif="motif_A", target_motif="motif_B", atoms=["CA", "CB"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.
Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running motif-specific RMSD calculations.
Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the motif RMSD calculation process to their specific needs, including specifying reference and target motifs, as well as atoms for RMSD calculations.
This method is designed to streamline the execution of motif-specific RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze motif-specific RMSD calculations.
- set_atoms(atoms=None)[source]
Set the atoms used for superposition and RMSD calculations.
- Parameters:
atoms (
list[str]) – The atoms used for superposition.- Return type:
None
- set_jobstarter(jobstarter)[source]
Set the jobstarter configuration for the MotifRMSD runner.
- Parameters:
jobstarter (
JobStarter) – The jobstarter configuration.- Raises:
ValueError – If jobstarter is not of type JobStarter.
- Return type:
None
- set_ref_col(col)[source]
Set the reference column for RMSD calculations.
- Parameters:
col (
str) – The reference column name.- Return type:
None
- set_ref_motif(motif)[source]
Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- set_return_superimposed_poses(return_superimposed_poses)[source]
Method to set if superimposed poses should be returned. :return_superimposed_poses: has to be bool
- Parameters:
return_superimposed_poses (bool)
- Return type:
None
- set_target_motif(motif)[source]
Method to set target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- setup_input_dict(poses, ref_col, ref_motif=None, target_motif=None)[source]
Set up the input dictionary for motif RMSD calculations.
This method prepares a dictionary that can be written to a JSON file and used as input for the motif RMSD calculation script. The dictionary contains mappings of poses to reference PDB files, target motifs, and reference motifs.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.ref_col (
str) – The reference column for RMSD calculations.ref_motif (
Any, optional) – The reference motif for RMSD calculations. Defaults to None.target_motif (
Any, optional) – The target motif for RMSD calculations. Defaults to None.
- Returns:
A dictionary structured for input to the motif RMSD calculation script.
- Return type:
- Raises:
TypeError – If ref_motif or target_motif is not of the expected type.
Examples
Here is an example of how to use the setup_input_dict method:
from rmsd import MotifRMSD from protflow.poses import Poses # Initialize the MotifRMSD class motif_rmsd = MotifRMSD() # Create a Poses object poses = Poses() # Set up the input dictionary for RMSD calculations input_dict = motif_rmsd.setup_input_dict( poses=poses, ref_col="reference", ref_motif="motif_A", target_motif="motif_B" ) # Print the input dictionary print(input_dict)
- Further Details:
Dictionary Structure: The input dictionary maps each pose to its reference PDB file, target motif, and reference motif.
Parameter Handling: The method handles different types of inputs for motifs, ensuring that they are correctly formatted for the RMSD calculation script.
Integration: The input dictionary prepared by this method is used by the run method to execute motif RMSD calculations.
- class protflow.metrics.rmsd.MotifSeparateSuperpositionRMSD(ref_col=None, super_target_motif=None, super_ref_motif=None, super_atoms=None, rmsd_target_motif=None, rmsd_ref_motif=None, rmsd_atoms=None, super_include_het_atoms=False, rmsd_include_het_atoms=False, jobstarter=None, overwrite=False)[source]
Bases:
RunnerMotifSeparateSuperpositionRMSD Class
The MotifSeparateSuperpositionRMSD class is a specialized class designed to facilitate the separate superposition and calculation of RMSD values for specific motifs within protein structures in the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with motif-specific superposition and RMSD calculations.
Detailed Description
The MotifSeparateSuperpositionRMSD class manages all aspects of superpositioning on one motif and calculating RMSD for another within protein structures. It handles the configuration of necessary scripts and executables, prepares the environment for RMSD calculations, and executes the commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to motif RMSD calculation scripts and Python executables.
Configuring job starter options, either automatically or manually.
Handling the execution of RMSD commands with support for various motifs and chains.
Collecting and processing output data into a pandas DataFrame.
Managing overwrite options and handling existing score files.
- rtype:
An instanceofthe `MotifSeparateSuperpositionRMSDclass`,configuredtorun motif RMSD calculationsandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises TypeError:
Examples
Here is an example of how to initialize and use the MotifRMSD class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import MotifSeparateSuperpositionRMSD # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the MotifRMSD class motif_rmsd = MotifSeparateSuperpositionRMSD() # Run the motif RMSD calculation results = motif_rmsd.run( poses=poses, prefix="experiment_2", jobstarter=jobstarter, ref_col="reference", super_ref_motif="motif_A", super_target_motif="motif_B", super_atoms=["CA", "CB"], rmsd_ref_motif="motif_C", rmsd_target_motif=""motif_D", rmsd_atoms = ["CA"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the motif RMSD calculation process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The MotifSeparateSuperpositionRMSD class is intended for researchers and developers who need to perform RMSD calculations for specific motifs as part of their protein design and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(ref_col=None, super_target_motif=None, super_ref_motif=None, super_atoms=None, rmsd_target_motif=None, rmsd_ref_motif=None, rmsd_atoms=None, super_include_het_atoms=False, rmsd_include_het_atoms=False, jobstarter=None, overwrite=False)[source]
Initialize the MotifSeparateSuperpositionRMSD class.
This constructor sets up the MotifSeparateSuperpositionRMSD instance with default or provided parameters. It configures the reference column, superposition target motif, superposition reference motif, rmsd target motif, rmsd reference motif, inclusion of hetero atoms, jobstarter, and overwrite options for RMSD calculations.
- Parameters:
ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.super_target_motif (
str, optional) – The target motif for superpositioning. Defaults to None.super_ref_motif (
str, optional) – The reference motif for superpositioning. Defaults to None.super_atoms (
list, optional) – The atom names for superpositioning. Defaults to None.super_include_het_atoms (
bool, optional) – Inclusion of heteroatoms (e.g. from ligands) in superpositioning. Defaults to False.rmsd_target_motif (
str, optional) – The target motif for RMSD calculations. Defaults to None.rmsd_ref_motif (
str, optional) – The reference motif for RMSD calculations. Defaults to None.rmsd_atoms (
list, optional) – The atom names for RMSD calculations. Defaults to None.rmsd_include_het_atoms (
bool, optional) – Inclusion of heteroatoms (e.g. from ligands) for RMSD calculations. Defaults to False.jobstarter (
JobStarter, optional) – The jobstarter configuration for running the RMSD calculations. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.
- Returns:
None
Examples
Here is an example of how to initialize the MotifRMSD class:
from rmsd import MotifSeparateSuperpositionRMSD # Initialize the MotifSeparateSuperpositionRMSD class with default parameters motif_rmsd = MotifSeparateSuperpositionRMSD() # Initialize the MotifSeparateSuperpositionRMSD class with custom parameters motif_rmsd = MotifSeparateSuperpositionRMSD( ref_col="reference", super_ref_motif="motif_A", super_target_motif="motif_B", super_atoms=["CA", "CB"], rmsd_ref_motif="motif_C", rmsd_target_motif=""motif_D", rmsd_atoms = None, rmsd_include_het_atoms = True, overwrite=True )
- Further Details:
Default Values: If no parameters are provided, the class initializes with default values suitable for basic motif-specific RMSD calculations.
Parameter Storage: The parameters provided during initialization are stored as instance variables, which are used in subsequent method calls.
Custom Configuration: Users can customize the motif RMSD calculation process by providing specific values for the reference column, target motif, reference motif, target chains, reference chains, jobstarter, and overwrite option.
- run(poses, prefix, jobstarter=None, ref_col=None, super_ref_motif=None, super_target_motif=None, super_atoms=None, rmsd_ref_motif=None, rmsd_target_motif=None, rmsd_atoms=None, rmsd_include_het_atoms=False, super_include_het_atoms=False, overwrite=False)[source]
Superposition on one motif and calculate the RMSD on another for given poses and jobstarter configuration.
This method sets up and runs the motif-specific superposition and RMSD calculation process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.ref_col (
str, optional) – The reference column for RMSD calculations. Defaults to None.super_target_motif (
str, optional) – The target motif for superpositioning. Defaults to None.super_ref_motif (
str, optional) – The reference motif for superpositioning. Defaults to None.super_atoms (
list, optional) – The atom names for superpositioning. Defaults to None.super_include_het_atoms (
bool, optional) – Inclusion of heteroatoms (e.g. from ligands) in superpositioning. Defaults to False.rmsd_target_motif (
str, optional) – The target motif for RMSD calculations. Defaults to None.rmsd_ref_motif (
str, optional) – The reference motif for RMSD calculations. Defaults to None.rmsd_atoms (
list, optional) – The atom names for RMSD calculations. Defaults to None.rmsd_include_het_atoms (
bool, optional) – Inclusion of heteroatoms (e.g. from ligands) for RMSD calculations. Defaults to False.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.
- Returns:
An instance of the Poses class, containing the processed poses and results of the RMSD calculation.
- Return type:
- Raises:
FileNotFoundError – If required files or directories are not found during the execution process.
ValueError – If invalid arguments are provided to the method.
TypeError – If motifs or atoms are not of the expected type.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from rmsd import MotifRMSD # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the MotifSeparateSuperpositionRMSD class motif_rmsd = MotifSeparateSuperpositionRMSD() # Run the motif RMSD calculation results = motif_rmsd.run( poses=poses, prefix="experiment_2", jobstarter=jobstarter, ref_col="reference", super_ref_motif="motif_A", super_target_motif="motif_B", super_atoms=["CA", "CB"], rmsd_ref_motif="motif_C", rmsd_target_motif=""motif_D", rmsd_atoms = ["CA"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed. It supports splitting poses into sublists for parallel processing.
Input Handling: The method prepares input JSON files for each sublist of poses and constructs commands for running motif-specific RMSD calculations.
Output Management: The method handles the collection and processing of output data from multiple score files, concatenating them into a single DataFrame and saving the results.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the motif RMSD calculation process to their specific needs, including specifying reference and target motifs, as well as atoms for RMSD calculations.
This method is designed to streamline the execution of motif-specific RMSD calculations within the ProtFlow framework, making it easier for researchers and developers to perform and analyze motif-specific RMSD calculations.
- set_jobstarter(jobstarter)[source]
Set the jobstarter configuration for the MotifRMSD runner.
- Parameters:
jobstarter (
JobStarter) – The jobstarter configuration.- Raises:
ValueError – If jobstarter is not of type JobStarter.
- Return type:
None
- set_ref_col(col)[source]
Set the reference column for RMSD calculations.
- Parameters:
col (
str) – The reference column name.- Return type:
None
- set_rmsd_atoms(atoms=None)[source]
Set the atoms used for RMSD calculations.
- Parameters:
atoms (
list[str]) – The atoms used for superposition.- Return type:
None
- set_rmsd_include_het_atoms(include_het_atoms)[source]
Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
include_het_atoms (bool)
- Return type:
None
- set_rmsd_ref_motif(motif)[source]
Method to set rmsd reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- set_rmsd_target_motif(motif)[source]
Method to set rmsd target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- set_super_atoms(atoms=None)[source]
Set the atoms used for superposition and RMSD calculations.
- Parameters:
atoms (
list[str]) – The atoms used for superposition.- Return type:
None
- set_super_include_het_atoms(include_het_atoms)[source]
Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
include_het_atoms (bool)
- Return type:
None
- set_super_ref_motif(motif)[source]
Method to set reference motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- set_super_target_motif(motif)[source]
Method to set target motif. :motif: has to be string and should be a column name in poses.df that will be passed to the .run() function
- Parameters:
motif (str)
- Return type:
None
- setup_input_dict(poses, ref_col, ref_motif=None, target_motif=None, rmsd_ref_motif=None, rmsd_target_motif=None)[source]
Set up the input dictionary for motif RMSD calculations.
This method prepares a dictionary that can be written to a JSON file and used as input for the motif RMSD calculation script. The dictionary contains mappings of poses to reference PDB files, target motifs, and reference motifs.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.ref_col (
str) – The reference column for RMSD calculations.ref_motif (
Any, optional) – The reference motif for superposition. Defaults to None.target_motif (
Any, optional) – The target motif for superposition. Defaults to None.rmsd_ref_motif (
Any, optional) – The reference motif for RMSD calculations. Defaults to None.rmsd_target_motif (
Any, optional) – The target motif for RMSD calculations. Defaults to None.
- Returns:
A dictionary structured for input to the motif RMSD calculation script.
- Return type:
- Raises:
TypeError – If ref_motif or target_motif is not of the expected type.
Examples
Here is an example of how to use the setup_input_dict method:
from rmsd import MotifRMSD from protflow.poses import Poses # Initialize the MotifRMSD class motif_rmsd = MotifRMSD() # Create a Poses object poses = Poses() # Set up the input dictionary for RMSD calculations input_dict = motif_rmsd.setup_input_dict( poses=poses, ref_col="reference", ref_motif="motif_A", target_motif="motif_B" ) # Print the input dictionary print(input_dict)
- Further Details:
Dictionary Structure: The input dictionary maps each pose to its reference PDB file, target motif, and reference motif.
Parameter Handling: The method handles different types of inputs for motifs, ensuring that they are correctly formatted for the RMSD calculation script.
Integration: The input dictionary prepared by this method is used by the run method to execute motif RMSD calculations.
protflow.metrics.tmscore module
TMscore Module
This module provides the functionality to integrate TMscore calculations within the ProtFlow framework. It offers tools to run TMscore and TMalign, handle their inputs and outputs, and process the resulting data in a structured and automated manner.
Detailed Description
The TMalign and TMscore classes encapsulate the functionality necessary to execute TM-align and TM-score runs, respectively. These classes manage the configuration of paths to essential scripts and Python executables, set up the environment, and handle the execution of scoring processes. They include methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.
The module is designed to streamline the integration of TM-align and TM-score into larger computational workflows. It supports the automatic setup of job parameters, execution of TM-align/TM-score commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.
Usage
To use this module, create an instance of the TMalign or TMscore class and invoke their run method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the scoring process is provided through various parameters, allowing for customized runs tailored to specific research needs.
Examples
Here is an example of how to initialize and use the TMalign class within a ProtFlow pipeline:
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMalign
# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()
# Initialize the TMalign class
tmalign = TMalign()
# Run the alignment process
results = tmalign.run(
poses=poses,
prefix="experiment_1",
ref_col="reference_pdb",
sc_tm_score=True,
options="-a",
pose_options=["-b"],
overwrite=True
)
# Access and process the results
print(results)
Here is an example of how to initialize and use the TMscore class within a ProtFlow pipeline:
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from tmscore import TMscore
# Create instances of necessary classes
poses = Poses()
jobstarter = JobStarter()
# Initialize the TMscore class
tmscore = TMscore()
# Run the scoring process
results = tmscore.run(
poses=poses,
prefix="experiment_2",
ref_col="reference_pdb",
options="-c",
pose_options=["-d"],
overwrite=True
)
# Access and process the results
print(results)
Further Details
Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the scoring process.
Customizability: Users can customize the scoring process through multiple parameters, including specific options for the TM-align or TM-score scripts, and options for handling pose-specific parameters.
Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to incorporate TM-align or TM-score into their protein structure comparison and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Author
Markus Braun, Adrian Tripp
Version
0.1.0
- class protflow.metrics.tmscore.TMalign(jobstarter=None, application=None)[source]
Bases:
RunnerTMalign Class
The TMalign class is a specialized class designed to facilitate the execution of TMalign within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with TMalign processes.
Detailed Description
The TMalign class manages all aspects of running TMalign simulations. It handles the configuration of necessary scripts and executables, prepares the environment for alignment processes, and executes the alignment commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to TMalign executables.
Configuring job starter options, either automatically or manually.
Handling the execution of TMalign commands with support for various alignment options.
Collecting and processing output data into a pandas DataFrame.
Normalizing TM scores based on the reference structure and calculating self-consistency scores.
- rtype:
An instanceofthe `TMalignclass`,configuredtorun TMalign processesandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises RuntimeError:
Examples
Here is an example of how to initialize and use the TMalign class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from tmscore import TMalign # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the TMalign class tmalign = TMalign() # Run the alignment process results = tmalign.run( poses=poses, prefix="experiment_1", ref_col="reference_pdb", sc_tm_score=True, options="-a", pose_options=["-b"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the alignment process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
Difference Between TMscore and TMalign
TMscore: This class calculates the TM-score between protein structures without superimposing them. It is suitable for comparing the overall similarity of protein structures in a sequence-length independent manner. TMscore is used when you need to score the structural similarity directly without modifying the positions of the structures.
TMalign: This class not only calculates the TM-score but also superimposes the structures before scoring. It is used when structural alignment and superimposition are necessary to get a more accurate measure of structural similarity, considering the spatial arrangement of the protein structures.
The TMalign class is intended for researchers and developers who need to perform TMalign alignments as part of their protein structure comparison and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(jobstarter=None, application=None)[source]
Initialize the TMalign class with optional jobstarter and application path.
This method sets up the TMalign class by configuring the jobstarter and the path to the TMalign executable. It ensures that the necessary components are ready for executing TMalign processes.
- Parameters:
jobstarter (
JobStarter, optional) – An optional jobstarter configuration. Defaults to None.application (
str, optional) – Path to the TMalign executable. If not provided, it defaults to the TMalign executable in the ProtFlow environment.
- Raises:
ValueError – If the TMalign executable is not found in the specified environment.
Examples
Here is an example of how to initialize the TMalign class:
from tmscore import TMalign # Initialize the TMalign class tmalign = TMalign( jobstarter=LocalJobStarter(max_cores=4), application="/path/to/TMalign" ) # Check the instance print(tmalign)
- Further Details:
Jobstarter Configuration: This parameter allows setting up the jobstarter for managing job execution.
Application Path: This parameter sets the path to the TMalign executable, ensuring the correct executable is used for alignment processes.
This method is designed to prepare the TMalign class for executing TMalign processes, ensuring that all necessary configurations are in place.
- collect_scores(output_dir)[source]
Collect scores from TMalign output files.
This method collects and processes the scores from the output files generated by TMalign. It reads the scores, extracts relevant information, and organizes the data into a structured pandas DataFrame.
- Parameters:
output_dir (
str) – The directory where TMalign output files are located.- Returns:
A DataFrame containing the collected scores.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If no TM scores are found in the output files.
Examples
Here is an example of how to use the collect_scores method:
from tmscore import TMalign # Initialize the TMalign class tmalign = TMalign() # Collect scores scores_df = tmalign.collect_scores( output_dir="output/" ) # Print the scores DataFrame print(scores_df)
- Further Details:
Score Extraction: The method reads the output files, extracts relevant scores, and organizes them into a pandas DataFrame.
Validation: Ensures that the scores are correctly extracted and that no errors occurred during the process.
This method is designed to streamline the collection and processing of scores from TMalign output files, ensuring that all relevant data is accurately captured and organized.
- prep_ref(ref, poses)[source]
Prepare the reference structures for TMalign.
This method prepares the reference structures for alignment based on the provided reference column or specific PDB file. It ensures that the references are correctly formatted for the TMalign process.
- Parameters:
ref (
str) – The reference structure, either as a path to a PDB file or as a column name in the Poses DataFrame.poses (
Poses) – The Poses object containing the protein structures.
- Returns:
A list of reference paths for each pose.
- Return type:
- Raises:
ValueError – If the ref parameter is not a string or if the reference column is missing from the Poses DataFrame.
Examples
Here is an example of how to use the prep_ref method:
from protflow.poses import Poses from tmscore import TMalign # Create instances of necessary classes poses = Poses() tmalign = TMalign() # Prepare reference structures ref_list = tmalign.prep_ref( ref="reference_pdb", poses=poses ) # Print the reference list print(ref_list)
- Further Details:
Reference Handling: The method can handle both a single PDB file and a column name referring to multiple PDB files within the Poses DataFrame.
Validation: Ensures that the provided reference is valid and exists in the Poses DataFrame if specified as a column name.
This method is designed to streamline the preparation of reference structures for TMalign processes, ensuring that all references are correctly formatted and validated.
- run(poses, prefix, ref_col, sc_tm_score=True, options=None, pose_options=None, overwrite=False, jobstarter=None)[source]
Execute the TMalign process with given poses and jobstarter configuration.
This method sets up and runs the TMalign process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.ref_col (
str|) – Column containing paths to PDB files used as reference for TM score calculation. Can also be a path to a singular reference .pdb file.sc_tm_score (
bool, optional) – If True, calculates the self-consistency TM score for each backbone in ref_col and adds it into the column {prefix}_sc_tm. Defaults to True.options (
str, optional) – Additional command-line options for the TMalign script. Defaults to None.pose_options (
str, optional) – Name of poses.df column containing options for TMalign. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.
- Returns:
An instance of the RunnerOutput class, containing the processed poses and results of the TMalign process.
- Return type:
- Raises:
FileNotFoundError – If the TMalign executable is not found in the specified environment.
ValueError – If invalid arguments are provided to the method or if required reference columns are missing.
RuntimeError – If no TM scores are found in the output files.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from tmscore import TMalign # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the TMalign class tmalign = TMalign() # Run the alignment process results = tmalign.run( poses=poses, prefix="experiment_1", ref_col="reference_pdb", sc_tm_score=True, options="-a", pose_options=["-b"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.
Reference Preparation: The method prepares the reference structures for alignment based on the provided reference column or specific PDB file.
Output Management: The method handles the collection and processing of output data, including merging and normalizing TM scores, ensuring that results are organized and accessible for further analysis.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the alignment process to their specific needs.
This method is designed to streamline the execution of TMalign processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein structure alignments.
- write_cmd(pose_path, ref_path, output_dir, options=None, pose_options=None)[source]
Write the command to run TMalign.
This method constructs the command to execute TMalign based on the provided parameters. It formats the options and flags correctly and sets up the command to be run in the environment.
- Parameters:
pose_path (
str) – The path to the pose file.ref_path (
str) – The path to the reference file.output_dir (
str) – The directory where output files will be saved.options (
str, optional) – Additional command-line options for TMalign. Defaults to None.pose_options (
str, optional) – Pose-specific options for TMalign. Defaults to None.
- Returns:
The constructed command string to run TMalign.
- Return type:
Examples
Here is an example of how to use the write_cmd method:
from tmscore import TMalign # Initialize the TMalign class tmalign = TMalign() # Write the command cmd = tmalign.write_cmd( pose_path="pose.pdb", ref_path="reference.pdb", output_dir="output/", options="-a", pose_options="-b" ) # Print the command print(cmd)
- Further Details:
Command Construction: The method constructs the command string by parsing and formatting the provided options and pose-specific options.
Output Management: Ensures that the output files are correctly named and saved in the specified directory.
This method is designed to streamline the construction of commands for TMalign processes, ensuring that all necessary options are correctly formatted and included.
- Parameters:
jobstarter (JobStarter)
application (str)
- class protflow.metrics.tmscore.TMscore(jobstarter=None, application=None)[source]
Bases:
RunnerTMscore Class
The TMscore class is a specialized class designed to facilitate the execution of TMscore within the ProtFlow framework. It extends the Runner class and incorporates specific methods to handle the setup, execution, and data collection associated with TMscore processes.
Detailed Description
The TMscore class manages all aspects of running TMscore simulations. It handles the configuration of necessary scripts and executables, prepares the environment for scoring processes, and executes the scoring commands. Additionally, it collects and processes the output data, organizing it into a structured format for further analysis.
- Key functionalities include:
Setting up paths to TMscore executables.
Configuring job starter options, either automatically or manually.
Handling the execution of TMscore commands with support for various scoring options.
Collecting and processing output data into a pandas DataFrame.
Difference Between TMscore and TMalign
TMscore: This class calculates the TM-score between protein structures without superimposing them. It is suitable for comparing the overall similarity of protein structures in a sequence-length independent manner. TMscore is used when you need to score the structural similarity directly without modifying the positions of the structures.
TMalign: This class not only calculates the TM-score but also superimposes the structures before scoring. It is used when structural alignment and superimposition are necessary to get a more accurate measure of structural similarity, considering the spatial arrangement of the protein structures.
- rtype:
An instanceofthe `TMscoreclass`,configuredtorun TMscore processesandhandle outputs efficiently.- raises FileNotFoundError:
- raises ValueError:
- raises RuntimeError:
Examples
Here is an example of how to initialize and use the TMscore class:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from tmscore import TMscore # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the TMscore class tmscore = TMscore() # Run the scoring process results = tmscore.run( poses=poses, prefix="experiment_2", ref_col="reference_pdb", options="-c", pose_options=["-d"], overwrite=True ) # Access and process the results print(results)
Further Details
Edge Cases: The class includes handling for various edge cases, such as empty pose lists, the need to overwrite previous results, and the presence of existing score files.
Customization: The class provides extensive customization options through its parameters, allowing users to tailor the scoring process to their specific needs.
Integration: Seamlessly integrates with other ProtFlow components, leveraging shared configurations and data structures for a unified workflow.
The TMscore class is intended for researchers and developers who need to perform TMscore calculations as part of their protein structure comparison and analysis workflows. It simplifies the process, allowing users to focus on analyzing results and advancing their research.
- __init__(jobstarter=None, application=None)[source]
Initialize the TMscore class with optional jobstarter and application path.
This method sets up the TMscore class by configuring the jobstarter and the path to the TMscore executable. It ensures that the necessary components are ready for executing TMscore processes.
- Parameters:
Examples
Here is an example of how to initialize the TMscore class:
from tmscore import TMscore # Initialize the TMscore class tmscore = TMscore( jobstarter="local", application="/path/to/TMscore" ) # Check the instance print(tmscore)
- Further Details:
Jobstarter Configuration: This parameter allows setting up the jobstarter for managing job execution.
Application Path: This parameter sets the path to the TMscore executable, ensuring the correct executable is used for scoring processes.
This method is designed to prepare the TMscore class for executing TMscore processes, ensuring that all necessary configurations are in place.
- collect_scores(output_dir)[source]
Collect scores from TMscore output files.
This method collects and processes the scores from the output files generated TMscore. It reads the scores, extracts relevant information, and organizes the data into a structured pandas DataFrame.
- Parameters:
output_dir (
str) – The directory where TMscore output files are located.- Returns:
A DataFrame containing the collected scores.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If no TM scores are found in the output files.
Examples
Here is an example of how to use the collect_scores method:
from tmscore import TMscore # Initialize the TMscore class tmalign = TMscore() # Collect scores scores_df = tmscore.collect_scores( output_dir="output/" ) # Print the scores DataFrame print(scores_df)
- Further Details:
Score Extraction: The method reads the output files, extracts relevant scores, and organizes them into a pandas DataFrame.
Validation: Ensures that the scores are correctly extracted and that no errors occurred during the process.
This method is designed to streamline the collection and processing of scores from TMscore output files, ensuring that all relevant data is accurately captured and organized.
- run(poses, prefix, ref_col, options=None, pose_options=None, overwrite=False, jobstarter=None)[source]
Execute the TMscore process with given poses and jobstarter configuration.
This method sets up and runs the TMscore process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.
- Parameters:
poses (
Poses) – The Poses object containing the protein structures.prefix (
str) – A prefix used to name and organize the output files.ref_col (
str) – Column containing paths to PDB files used as reference for TM score calculation.options (
str, optional) – Additional command-line options for the TMscore script. Defaults to None.pose_options (
str, optional) – Name of poses.df column containing options for TMscore. Defaults to None.overwrite (
bool, optional) – If True, overwrite existing output files. Defaults to False.jobstarter (
JobStarter, optional) – An instance of the JobStarter class, which manages job execution. Defaults to None.
- Returns:
An instance of the RunnerOutput class, containing the processed poses and results of the TMscore process.
- Return type:
- Raises:
FileNotFoundError – If the TMscore executable is not found in the specified environment.
ValueError – If invalid arguments are provided to the method or if required reference columns are missing.
RuntimeError – If no TM scores are found in the output files.
Examples
Here is an example of how to use the run method:
from protflow.poses import Poses from protflow.jobstarters import JobStarter from tmscore import TMscore # Create instances of necessary classes poses = Poses() jobstarter = JobStarter() # Initialize the TMscore class tmscore = TMscore() # Run the scoring process results = tmscore.run( poses=poses, prefix="experiment_2", ref_col="reference_pdb", options="-c", pose_options=["-d"], overwrite=True ) # Access and process the results print(results)
- Further Details:
Setup and Execution: The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.
Reference Handling: The method validates the reference column and prepares the reference structures for scoring.
Output Management: The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.
Customization: Extensive customization options are provided through parameters, allowing users to tailor the scoring process to their specific needs.
This method is designed to streamline the execution of TMscore processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein structure comparisons.
- write_cmd(pose_path, ref_path, output_dir, options=None, pose_options=None)[source]
Write the command to run TMscore.
This method constructs the command to execute TMscore based on the provided parameters. It formats the options and flags correctly and sets up the command to be run in the environment.
- Parameters:
pose_path (
str) – The path to the pose file.ref_path (
str) – The path to the reference file.output_dir (
str) – The directory where output files will be saved.options (
str, optional) – Additional command-line options for TMscore. Defaults to None.pose_options (
str, optional) – Pose-specific options for TMscore. Defaults to None.
- Returns:
The constructed command string to run TMscore.
- Return type:
Examples
Here is an example of how to use the write_cmd method:
from tmscore import TMscore # Initialize the TMscore class tmscore = TMscore() # Write the command cmd = tmscore.write_cmd( pose_path="pose.pdb", ref_path="reference.pdb", output_dir="output/", options="-a", pose_options="-b" ) # Print the command print(cmd)
- Further Details:
Command Construction: The method constructs the command string by parsing and formatting the provided options and pose-specific options.
Output Management: Ensures that the output files are correctly named and saved in the specified directory.
This method is designed to streamline the construction of commands for TMscore processes, ensuring that all necessary options are correctly formatted and included.
Module contents
protflow.metrics subpackage init