ProtFlow documentation
Subpackages
- protflow.metrics package
- protflow.tools package
- Submodules
- protflow.tools.boltz module
- protflow.tools.attnpacker module
- protflow.tools.colabfold module
- protflow.tools.esmfold module
- protflow.tools.ligandmpnn module
- protflow.tools.protein_edits module
- protflow.tools.protein_generator module
- protflow.tools.residue_selectors module
- protflow.tools.rfdiffusion module
- protflow.tools.rosetta module
- Module contents
- protflow.utils package
protflow.config_template module
This module contains all paths to tools integrated in ProtFlow. PRE_CMD are commands that should be run before the runner is executed (e.g. if import of a specific module is necessary for the environment to work)
protflow.jobstarters module
jobstarters
This module, jobstarters, provides a set of classes and methods to facilitate the submission and management of computing jobs on various job scheduling systems. JobStarters are passed to Runner objects in their .run() methods to facilitate a standardized execution of commands generated by the Runner. JobStarters can also be executed outside of Runner classes as is shown in the examples.
The JobStarter class defines a base JobStarter class with methods that need to be implemented by subclasses to start jobs and wait for their completion.
Overview
The module includes the following classes and methods:
Classes
JobStarter: An abstract base class that defines the interface for all jobstarters.
SbatchArrayJobstarter: A concrete implementation of JobStarter for managing SLURM job arrays.
LocalJobStarter: A concrete implementation of JobStarter for managing local jobs.
Usage
To use a jobstarter, instantiate an appropriate subclass (e.g., SbatchArrayJobstarter) and call its start method with the desired commands and options. Use the wait_for_job method if you need to wait for job completion.
Example
>>> from jobstarters import SbatchArrayJobstarter
>>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")
Note
This module is designed to be extended with additional jobstarters for different scheduling systems as needed. If you want to implement your own JobStarter and need assistance, please contact any of the authors of ProtFlow for assistance. We are happy about every contribution!
- class protflow.jobstarters.JobStarter(max_cores=None)[source]
Bases:
objectAbstract base class for job starters.
This class defines the interface for all job starters. Subclasses should implement methods to start jobs and wait for their completion. It also includes a method to set the maximum number of cores available for the jobs.
Examples
This class is designed to be extended by other classes that implement specific job scheduling systems.
Example subclass implementation:
class CustomJobStarter(JobStarter): def start(self, cmds, jobname, wait, output_path): # Implementation for starting jobs pass def wait_for_job(self, jobname, interval): # Implementation for waiting for job completion pass
- Parameters:
max_cores (int)
- __init__(max_cores=None)[source]
Initializes the JobStarter with an optional maximum number of cores.
- Parameters:
max_cores (
int, optional) – The maximum number of cores that can be used for the jobs. Default is None.
- set_max_cores(cores)[source]
Sets the maximum number of cores available for the jobs.
- Parameters:
cores (
int) – The maximum number of cores to set.- Return type:
None
- start(cmds, jobname, wait, output_path)[source]
Submits a list of commands as jobs to the scheduling system.
- Parameters:
- Raises:
NotImplementedError – If this method is not implemented in a subclass.
- Return type:
None
- wait_for_job(jobname, interval)[source]
Waits for a job to complete before proceeding.
- Parameters:
- Raises:
NotImplementedError – If this method is not implemented in a subclass.
- Return type:
None
- class protflow.jobstarters.LocalJobStarter(max_cores=1)[source]
Bases:
JobStarterJobstarter that runs jobs locally using subprocess.run().
This class extends the JobStarter base class to provide functionality for running jobs locally on the machine. It handles the execution of commands using subprocesses, manages the maximum number of concurrent processes, and captures the output and error logs for each command.
- Parameters:
max_cores (
int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.- Raises:
ProcessError – If a subprocess crashes during execution.
Examples
Example usage:
>>> from jobstarters import LocalJobStarter >>> job_starter = LocalJobStarter(max_cores=2) >>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")
- __init__(max_cores=1)[source]
Initializes the LocalJobStarter with an optional parameter for maximum cores.
- Parameters:
max_cores (
int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.
- class protflow.jobstarters.SbatchArrayJobstarter(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]
Bases:
JobStarterJobstarter that manages the submission of job arrays to SLURM clusters.
This class extends the JobStarter base class to provide functionality specific to SLURM job arrays. It handles tasks such as generating command files, submitting jobs using sbatch, and waiting for job completion. It also supports options for GPU usage and automatic cleanup of command files after job completion.
- Parameters:
max_cores (
int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.remove_cmdfile (
bool, optional) – Whether to remove the command file after job completion. Default is False.options (
str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.gpus (
bool, optional) – Whether to use GPUs for the job. Default is False.batch_cmds (int)
- Raises:
TypeError – If the options parameter is not a string or list.
Examples
Example usage:
>>> from jobstarters import SbatchArrayJobstarter >>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True, options="--time=10:00", gpus=True) >>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")
- __init__(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]
Initializes the SbatchArrayJobstarter with optional parameters.
- Parameters:
max_cores (
int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.remove_cmdfile (
bool, optional) – Whether to remove the command file after job completion. Default is False.options (
str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.gpus (
bool, optional) – Whether to use GPUs for the job. Default is False.batch_cmds (
bool, optional) – Whether to batch the input cmds to the specified number. Default is None.
Note
The options parameter must be set when the Jobstarter is created, not when the .start function is executed.
- start(cmds, jobname, wait=True, output_path='./', batch_cmds=None)[source]
Writes commands into a command file and starts an SBATCH job running the command file.
- Parameters:
cmds (
list) – List of commands to be executed as part of the job array.jobname (
str) – Name of the job.wait (
bool, optional) – Whether to wait for the job to complete before returning. Default is True.output_path (
str, optional) – Path where output files should be stored. Default is “./”.batch_cmds (
bool, optional) – Whether to batch the input cmds to the specified number. Default is None.
- Raises:
RuntimeError – If the SLURM submission fails.
- Return type:
None
- protflow.jobstarters.add_timestamp(x)[source]
Adds a unique timestamp to a string using the time library.
This function appends a unique timestamp to the given string. The timestamp is generated using the current time, which ensures that the resulting string is unique in most cases. The timestamp is added as a suffix, separated by an underscore.
- Parameters:
x (
str) – The input string to which the timestamp will be added.- Returns:
The input string with a unique timestamp appended.
- Return type:
Examples
>>> add_timestamp("jobname") 'jobname_1632417284'
Notes
The timestamp is derived from the current time in seconds since the epoch, with the fractional part of the seconds included to ensure higher precision and uniqueness.
- protflow.jobstarters.get_SLURM_stats(job_name, start_time=None)[source]
Query
sacctand return aggregated resource statistics for a SLURM job array.Shells out to the SLURM
sacctcommand to retrieve per-task timing and CPU accounting data for all tasks in the array job identified by job_name. The raw per-task records are aggregated into a single summary dictionary, which is returned to the caller.Warning
This function must be called from the cluster login node or another host that has the
sacctbinary in itsPATHand access to SLURM’s accounting database. Calling it from within a compute-node job step (e.g. inside a running SLURM batch script) will fail becausesacctis not available on compute nodes.- Parameters:
job_name (
str) – The SLURM job name to query (passed tosacct --name). This corresponds to thejobnameargument supplied tostart()and is stored inlast_job_nameafter each submission.start_time (
str, optional) – ISO-8601 datetime string (YYYY-MM-DDTHH:MM:SS) passed tosacct --starttimeto restrict results to jobs that began at or after this timestamp. When omitted,sacctreturns all matching records regardless of age, which may cause false matches against stale jobs with the same name from earlier sessions. It is strongly recommended to pass thesession_startattribute of the enclosingSbatchArrayRunnerTimerto avoid this.
- Returns:
A dictionary containing aggregated statistics.
On success, keys include:
- job_namestr
The job_name argument echoed back.
- total_cpu_secint
Sum of
CPUTimeRawacross all tasks.- avg_task_runtime_secfloat
Mean wall-clock elapsed time per task in seconds (2 decimal places).
- max_task_runtime_secint
Wall-clock elapsed time of the longest-running task.
- min_task_runtime_secint
Wall-clock elapsed time of the shortest-running task.
- num_tasksint
Total number of task records returned.
- total_cpus_reservedint
Sum of
AllocCPUSacross all tasks.- statestr
"COMPLETED"or"MIXED (<states>)".- queried_afterstr or None
The start_time argument echoed back.
On failure, keys include:
- job_namestr
The job_name argument echoed back.
- errorstr
Human-readable description of the failure.
- Return type:
- Raises:
None – This function does not propagate exceptions. All errors are caught and returned as a dictionary with an
errorkey.
Notes
The
sacctcommand is invoked with-X(suppress sub-step records),--format JobName,ElapsedRaw,CPUTimeRaw,AllocCPUS,State,-n(no header), and-P(pipe-delimited output). The resulting fields are parsed by position.ElapsedRawis SLURM’s wall-clock elapsed time for each individual task in seconds;CPUTimeRawisElapsedRaw × AllocCPUSand reflects total CPU-core-seconds reserved (not necessarily consumed).The command is executed as a shell string (
shell=True) so that--starttimeand other arguments with special characters are handled correctly by the system shell.Empty or whitespace-only lines in
sacct’s stdout are filtered before parsing.The
stateaggregation logic is strict:"COMPLETED"is only returned when every task’s state is exactly"COMPLETED"(set equality). A single failed or cancelled task will produce a"MIXED"state.
Examples
Query statistics for a recently submitted job:
from protflow.jobstarters import get_SLURM_stats stats = get_SLURM_stats("caliby_seqdes", start_time="2025-06-01T12:00:00") print(stats)
- protflow.jobstarters.split_list(input_list, element_length=None, n_sublists=None)[source]
Splits a list into nested sublists with specified lengths or number of sublists.
This function divides the input list into a nested list of sublists. The division can be based on the maximum length of each sublist or the desired number of sublists. Only one of the parameters, element_length or n_sublists, should be specified at a time.
- Parameters:
input_list (
list) – The list to be split into sublists.element_length (
int, optional) – The maximum length of each sublist. If specified, the input list will be split into sublists each having up to element_length elements.n_sublists (
int, optional) – The desired number of sublists. If specified, the input list will be divided into n_sublists sublists.
- Returns:
A nested list containing the sublists.
- Return type:
- Raises:
ValueError – If both element_length and n_sublists are specified or if neither is specified.
Examples
Splitting a list into sublists of a specified maximum length:
>>> split_list([1, 2, 3, 4, 5, 6], element_length=2) [[1, 2], [3, 4], [5, 6]]
Splitting a list into a specified number of sublists:
>>> split_list([1, 2, 3, 4, 5, 6], n_sublists=3) [[1, 2], [3, 4], [5, 6]]
Notes
If n_sublists is specified and is greater than the length of the input list, the number of sublists will be equal to the length of the input list.
If neither element_length nor n_sublists is provided, or if both are provided, a ValueError will be raised.
protflow.poses module
poses Module
This module provides functionalities for handling and manipulating protein data within the ProtFlow framework. It focuses on managing protein data represented as Pandas DataFrames, allowing for efficient parsing, storage, and manipulation of protein data across various file formats. The module facilitates complex protein study workflows and integrates seamlessly with other components of the ProtFlow package.
Detailed Description
The poses module offers a robust class, Poses, designed to encapsulate the functionality necessary to manage protein data. It supports various operations such as setting up work directories, parsing protein data, and integrating outputs from different computational processes. The module ensures that the results are organized and accessible for further analysis within the ProtFlow ecosystem.
Key Features
Parsing Protein Data: Supports reading protein data from various file formats like JSON, CSV, Pickle, Feather, and Parquet.
Data Storage and Retrieval: Allows storing and retrieving protein data in multiple formats, facilitating easy data management.
Integration with ProtFlow: Seamlessly integrates with ProtFlow’s job management components, enhancing its utility in distributed computing environments.
Advanced Data Manipulation: Provides functionalities to merge and prefix data from various sources, making it easier to handle complex datasets.
Flexible and Customizable: Users can customize the data handling processes through various parameters, enabling tailored data management solutions.
Usage
To use this module, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:
from poses import Poses
# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')
# Further operations using poses_instance
poses_instance.save_scores('path/to/save/scores')
poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses')
Examples
Here is an example of how to initialize and use the Poses class for managing protein data:
from poses import Poses
# Create an instance of the Poses class
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')
# Perform various operations using the instance
poses_instance.set_work_dir('new/work/dir')
poses_instance.save_scores('path/to/save/scores', out_format='csv')
filtered_poses = poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>')
Further Details
Edge Cases: The module handles various edge cases such as empty pose lists and the need to overwrite previous results. It includes robust error handling and logging for easier debugging and verification.
Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.
Integration: The module integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
This module is intended for researchers and developers who need to manage protein data within their computational workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Version
0.1.0
- class protflow.poses.Poses(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]
Bases:
objectPoses Class
The Poses class within the ProtFlow package is designed for handling protein data, enabling the parsing, storage, and manipulation of protein data represented as Pandas DataFrames. This class facilitates the management of complex protein study workflows and integrates seamlessly with other components of the ProtFlow framework.
Detailed Description
The Poses class encapsulates the functionality necessary for comprehensive management of protein data. It supports various operations, including setting up work directories, parsing protein data from different sources, integrating outputs from different runners, and handling protein data in multiple file formats. This class is essential for users looking to streamline their protein data management within computational workflows.
Key Features
Work Directory Setup: Easily sets up and manages work directories for storing intermediate and final results.
Data Parsing: Parses protein data from various sources and formats, including JSON, CSV, Pickle, Feather, and Parquet.
Data Storage and Retrieval: Stores and retrieves protein data in multiple file formats, ensuring flexibility in data management.
Job Management Integration: Integrates with ProtFlow’s job management components, facilitating the handling of protein data in distributed computing environments.
Advanced Data Manipulation: Supports operations like merging, prefixing, and duplicating data, providing robust data manipulation capabilities.
Filtering and Scoring: Offers methods to filter protein data based on various criteria and calculate composite scores for better data analysis.
Pose Handling: Manages protein poses, including loading, saving, and converting between different formats (e.g., PDB to FASTA).
Usage
To use this class, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:
from poses import Poses # Initialize the Poses class with protein data and a working directory poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir') # Set up the work directory poses_instance.set_work_dir('path/to/new_work_dir') # Parse and manipulate poses poses_instance.set_poses(poses=my_protein_data) poses_instance.save_scores('path/to/save/scores', out_format='csv') # Filter poses filtered_poses = poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses') # Calculate a composite score poses_instance.calculate_composite_score(name='composite_score', scoreterms=['score1', 'score2'], weights=[0.5, 0.5], plot=True)
Further Details
Edge Cases: The class handles various edge cases, such as empty pose lists, the need to overwrite previous results, and handling multiline FASTA inputs.
Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.
Integration: The class integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.
Error Handling: Includes robust error handling and logging for easier debugging and verification of data processing steps.
- - `df`
A DataFrame to store protein data.
- Type:
pd.DataFrame
- - `work_dir`
The working directory for storing data and results.
- Type:
- - `storage_format`
The format for storing protein data (e.g., ‘json’, ‘csv’).
- Type:
- - `default_jobstarter`
The default job starter for managing jobs.
- Type:
JobStarter
Notes
This class is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Version
0.1.0
- __init__(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]
Initializes the Poses class with optional parameters for poses, working directory, storage format, glob suffix, and job starter.
- Parameters:
poses (
list, optional) – A list of paths to the protein data files to be managed. If not provided, an empty DataFrame is initialized.work_dir (
str, optional) – The working directory where intermediate and final results will be stored. If not provided, the current directory is used.storage_format (
str, optional) – The format used for storing protein data (default is ‘json’). Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.glob_suffix (
str, optional) – A suffix used for globbing multiple files. This allows for batch processing of files matching the given pattern.jobstarter (
JobStarter, optional) – An instance of the JobStarter class used to manage job submissions. The default is an instance of SbatchArrayJobstarter from the jobstarters module.
- df
A DataFrame to store protein data.
- Type:
pd.DataFrame
- default_jobstarter
The default job starter for managing jobs.
- Type:
JobStarter
Notes
This method initializes the Poses class and sets up various attributes required for managing protein data. It prepares the environment for subsequent data manipulation and analysis operations.
Example
from poses import Poses # Initialize the Poses class with protein data and a working directory poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')
- calculate_composite_score(name, scoreterms, weights, plot=False, scale_output=False)[source]
Calculates a composite score from specified score columns, applying weights and normalization, and optionally generates a plot.
- Parameters:
name (
str) – The name of the new composite score column to be created.scoreterms (
list[str]) – The list of score columns to be included in the composite score.weights (
list[float]) – The list of weights corresponding to each score column.plot (
bool, optional) – If True, generates a plot of the composite score and the individual score terms (default is False).scale_output (
bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).
- Returns:
The updated Poses instance with the new composite score column.
- Return type:
- Raises:
ValueError – If the number of scoreterms and weights do not match.
TypeError – If any score column contains non-numeric values.
Further Details
This method calculates a composite score from multiple score columns by applying the specified weights and normalizing the columns. The normalization process involves subtracting the median and dividing by the standard deviation for each score column. Optionally, the composite score can be scaled to a range between 0 and 1.
The method ensures that each score column contains numeric values and applies the normalization process as follows: 1. Calculate the median and standard deviation of each score column. 2. Normalize the column by subtracting the median and dividing by the standard deviation. 3. Optionally scale the normalized values to a range between 0 and 1.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate a composite score poses_instance.calculate_composite_score( name='composite_score', scoreterms=['score1', 'score2'], weights=[0.5, 0.5], plot=True, scale_output=True )
Notes
The method ensures that the number of scoreterms and weights match.
Normalization helps in making the scores comparable by removing scale differences.
Generates a violin plot if the plot parameter is set to True, showing the distribution of the composite score and individual score terms.
- calculate_max_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]
Calculate the maximum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.
- Parameters:
name (
str) – The name of the new column where the maximum values will be stored.score_col (
str) – The name of the column from which to calculate the maximum value.skipna (
bool, optional) – Whether to skip NA/null values. Default is False.remove_layers (
int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.sep (
str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.
- Returns:
The instance of the class with the maximum values added to the DataFrame.
- Return type:
self- Raises:
TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate the maximum values poses_instance.calculate_max_score( name='max_score1', score_col='score1', skipna=True, remove_layers=1, )
- calculate_mean_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]
Calculate the mean score of the selected score column. If remove_layers is set, calculates mean scores over poses grouped by the description column with the set number of index layers removed.
- Parameters:
name (
str) – The name of the new column where the mean scores will be stored.score_col (
str) – The name of the column from which to calculate the mean scores.skipna (
bool, optional) – Whether to skip NA/null values. Default is False.remove_layers (
int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.sep (
str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.
- Returns:
The instance of the class with the mean scores added to the DataFrame.
- Return type:
self- Raises:
TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate the mean score poses_instance.calculate_mean_score( name='mean_score1', score_col='score1', skipna=True, remove_layers=1, )
- calculate_median_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]
Calculate the median score of the selected score column. If remove_layers is set, calculates median scores over poses grouped by the description column with the set number of index layers removed.
- Parameters:
name (
str) – The name of the new column where the mean scores will be stored.score_col (
str) – The name of the column from which to calculate the median scores.skipna (
bool, optional) – Whether to skip NA/null values. Default is False.remove_layers (
int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.sep (
str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.
- Returns:
The instance of the class with the mean scores added to the DataFrame.
- Return type:
self- Raises:
TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate the median score poses_instance.calculate_median_score( name='median_score1', score_col='score1', skipna=True, remove_layers=1, )
- calculate_min_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]
Calculate the minimum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.
- Parameters:
name (
str) – The name of the new column where the minimum values will be stored.score_col (
str) – The name of the column from which to calculate the minimum value.skipna (
bool, optional) – Whether to skip NA/null values. Default is False.remove_layers (
int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.sep (
str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.
- Returns:
The instance of the class with the minimum values added to the DataFrame.
- Return type:
self- Raises:
TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate the minimum values poses_instance.calculate_min_score( name='min_score1', score_col='score1', skipna=True, remove_layers=1, )
- calculate_std_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]
Calculate the standard deviation of the selected score column. If remove_layers is set, calculates standard deviations over poses grouped by the description column with the set number of index layers removed.
- Parameters:
name (
str) – The name of the new column where the mean scores will be stored.score_col (
str) – The name of the column from which to calculate the standard deviation.skipna (
bool, optional) – Whether to skip NA/null values. Default is False.remove_layers (
int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.sep (
str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.
- Returns:
The instance of the class with the mean scores added to the DataFrame.
- Return type:
self- Raises:
TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Calculate the standard deviation poses_instance.calculate_std_score( name='mean_score1', score_col='score1', skipna=True, remove_layers=1, )
- change_poses_dir(poses_dir, copy=False, overwrite=False)[source]
Changes the directory of the stored poses, with options to copy or overwrite existing poses.
- Parameters:
- Returns:
Poses– The updated Poses instance with poses located in the new directory.Further Details---------------This method updates the pathsofthe stored posestoa new directory. If the `copyparameter is set` toTrue,the poses are copiedtothe new directory. The `overwriteparameter controls whether existing files in the new directory are overwritten.`
- Return type:
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir') # Change the directory of the poses poses_instance.change_poses_dir('path/to/new_poses_dir', copy=True, overwrite=True)
Notes
If copy is set to False, the method only updates the paths in the DataFrame without moving the files.
Raises a ValueError if the new directory does not exist or if the poses do not exist in the specified directory (when copy is False).
Ensures the integrity of the poses by verifying their existence in the new directory.
- check_poses_df_integrity(df)[source]
Checks the integrity of the poses DataFrame, ensuring it contains necessary columns.
- Parameters:
df (
pd.DataFrame) – The DataFrame to be checked for integrity.- Returns:
The validated poses DataFrame.
- Return type:
pd.DataFrame- Raises:
KeyError – If the DataFrame does not contain the mandatory columns ‘input_poses’, ‘poses’, and ‘poses_description’.
Further Details
This method verifies that the poses DataFrame contains the necessary columns required for proper functioning. It ensures that the DataFrame has ‘input_poses’, ‘poses’, and ‘poses_description’ columns, which are essential for various operations.
Example
from poses import Poses import pandas as pd # Initialize the Poses class poses_instance = Poses() # Create a sample DataFrame sample_df = pd.DataFrame({ 'input_poses': ['path/to/pose1.pdb'], 'poses': ['path/to/pose1.pdb'], 'poses_description': ['pose1'] }) # Check the integrity of the DataFrame validated_df = poses_instance.check_poses_df_integrity(sample_df)
Notes
The method raises a KeyError if any of the mandatory columns are missing.
Ensures that the DataFrame is properly structured for further data manipulation and analysis.
- check_prefix(prefix)[source]
Checks if the given prefix is already used in the poses DataFrame.
- Parameters:
prefix (
str) – The prefix to be checked in the poses DataFrame.- Raises:
KeyError – If the prefix is already used in the poses DataFrame.
- Return type:
None
Further Details
This method verifies whether the specified prefix is already in use within the poses DataFrame. It is useful for ensuring that new prefixes do not conflict with existing ones, maintaining data integrity.
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Check if a prefix is already used poses_instance.check_prefix('new_prefix')
Notes
The method raises a KeyError if the prefix is found in the DataFrame, indicating a conflict.
Ensures that new prefixes are unique and can be safely used for new columns or attributes.
- convert_pdb_to_fasta(prefix, update_poses=False, chain_sep=':')[source]
Converts PDB pose files to FASTA format and optionally updates the poses. Paths to fasta location are saved in poses dataframe under column <prefix>_fasta_location.
- Parameters:
- Raises:
RuntimeError – If the poses are not of type PDB.
- Return type:
None
Further Details
This method converts PDB pose files to FASTA format and stores them in a directory named with the given prefix. It can also update the poses DataFrame to use the new FASTA files if specified.
Example
from poses import Poses # Initialize the Poses class with some PDB poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Convert the PDB files to FASTA format poses_instance.convert_pdb_to_fasta(prefix='converted', update_poses=True)
Notes
The method checks that the poses are of type PDB before conversion.
Creates a new directory within the working directory to store the FASTA files.
Logs the conversion process and verifies the creation of FASTA files.
- convert_resselection_cols(resselection_col='import_resselection_cols')[source]
Converts per-row residue selection descriptors into
ResidueSelectionobjects for the columns listed in a list-like selector column, mutating the DataFrame in place.- Parameters:
resselection_col (
str, optional) – Name of the column that, for each row, contains a list/tuple of target column names to convert (default isimport_resselection_cols). When reading from CSV, this field may be a stringified list (e.g.,['a','b']), which will be parsed automatically.- Returns:
This method modifies self.df in place and returns None. If resselection_col is not present in self.df, the method exits early.
- Return type:
- Raises:
KeyError – If a row’s value in
resselection_colexists but is not a list or tuple (after optional string-to-list parsing).ValueError – If parsing a stringified list with
ast.literal_evalfails due to an invalid literal.SyntaxError – If parsing a malformed stringified list triggers a syntax error.
TypeError – If constructing a
ResidueSelectionfrom a cell value raises a type error.
Further Details
For each row, the method reads the list of target column names from
resselection_coland attempts to convert the corresponding cells:If a target column listed for a row does not exist in
self.df, a warning is logged and that column is skipped for the row.If the target cell is already a
ResidueSelectioninstance, it is left unchanged.If the target cell is a
str, it is converted viaResidueSelection(value)(useful for CSV imports).If the target cell is a
dict, it is converted viaResidueSelection(value, from_scorefile=True)(useful for JSON imports).Empty selector lists are allowed and simply result in no action for that row.
Cells that are falsy (e.g.,
None, empty string, empty dict) are skipped.
Example
import pandas as pd from protflow.poses import poses # Sample DataFrame where each row specifies which columns to convert df = pd.DataFrame({ "import_resselection_cols": [ ["fixed_residues", "motif_residues"], # row 0: convert two columns "['motif_residues']", # row 1: stringified list (from CSV) [] # row 2: nothing to convert ], "fixed_residues": [ "A12,A34,A56", # str -> ResidueSelection(str) None, # skipped "A1" ], "motif_residues": [ {"residues":[["A",164],["A",165],["A",166],["A",167]]}, # dict -> ResidueSelection(dict, from_scorefile=True) "B5-B9", # str -> ResidueSelection(str) {} ] }) poses = Poses(df) poses.convert_resselection_cols() # mutates poses.df in place # After this call: # - df.loc[0, "fixed_residues"] is a ResidueSelection instance # - df.loc[0, "motif_residues"] is a ResidueSelection instance (from dict) # - df.loc[1, "motif_residues"] is a ResidueSelection instance # - Row 2 remains unchanged due to empty selector and falsy cells
Notes
Missing target columns are not fatal; a warning is logged and processing continues.
When importing from CSV, stringified lists in
resselection_colare parsed withast.literal_eval; malformed strings will raiseValueErrororSyntaxError.ResidueSelectionconstruction is delegated; any errors it raises will propagate.
- determine_pose_type(pose_col=None)[source]
Determines the file types of the poses based on their extensions.
- Parameters:
pose_col (
str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).- Returns:
list– A list of unique file extensions found in the pose file paths.Further Details---------------This method extractsandidentifies the file extensionsofthe pose file paths in the specified column. It returns a listofunique file extensions,which helps in understanding the typesoffiles being managed.
- Return type:
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Determine the pose file types pose_types = poses_instance.determine_pose_type()
Notes
The method logs a warning if multiple file extensions are found.
If no file extensions are found, it logs a warning indicating the inability to determine file types.
Ensures that the returned list contains only unique file extensions.
- duplicate_poses(output_dir, n_duplicates, overwrite=False)[source]
Duplicates poses a specified number of times and saves them to an output directory.
- Parameters:
output_dir (
str) – The directory where the duplicated poses will be saved.n_duplicates (
int) – The number of duplicates to create for each pose.Details (Further)
---------------
them. (This method creates multiple copies of each pose file and saves them to the specified output directory. The duplicated files are named with an incremented index to distinguish)
overwrite (bool)
- Return type:
None
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Duplicate the poses poses_instance.duplicate_poses(output_dir='path/to/duplicates', n_duplicates=3)
Notes
The method creates the output directory if it does not exist.
Ensures that the duplicated files have unique names by appending an index.
Logs the duplication process and verifies the creation of duplicate files.
- filter_poses_by_rank(n, score_col, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None)[source]
Filters poses based on their rank in a specified score column, with options to handle layers and generate plots.
- Parameters:
n (
float) – The number of top-ranked poses to keep. If n < 1, it represents a fraction of the total poses.score_col (
str) – The column in the DataFrame containing the scores used for ranking.group_col (
str, optional) – Group dataframe by this column and filter individual groups.remove_layers (
int, optional) – The number of layers to remove from the pose descriptions before ranking. This helps in grouping similar poses.layer_col (
str, optional) – The column used for layer-based grouping of poses (default is “poses_description”).sep (
str, optional) – The separator used in the layer descriptions (default is “_”).ascending (
bool, optional) – If True, ranks poses in ascending order of scores; otherwise, in descending order (default is True).prefix (
str, optional) – The prefix used for naming the output filtered poses file and plot.plot (
bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).plot_cols (
list[str], optional) – Add additional plotting data to the output filtering plot.overwrite (
bool, optional) – If True, overwrites existing filtered poses files (default is True).storage_format (
str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).
- Returns:
Poses– The updated Poses instance with filtered poses.Further Details---------------This method filters the poses DataFrametoretain only the top-ranked poses based on their scores. It supports fractional ranking,layer-based grouping, andoptional plot generation for visualizing the filtering process. The filtered poses can be savedtoa file with a specified prefixandstorage format.
- Return type:
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Filter poses by rank poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='top_poses', plot=True)
Notes
The method creates a filtered poses file and an optional plot in the specified working directory.
Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.
Logs the filtering process, including any errors or warnings related to the ranking criteria.
- filter_poses_by_value(score_col, value, operator, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None, fail_on_empty=True)[source]
Filters poses based on a specified value in a score column, with options to generate plots.
- Parameters:
score_col (
str) – The column in the DataFrame containing the scores used for filtering.value (
floatorint) – The value used as the threshold for filtering poses.operator (
str) – The comparison operator used for filtering (‘>’, ‘>=’, ‘<’, ‘<=’, ‘=’, ‘!=’).prefix (
str, optional) – The prefix used for naming the output filtered poses file and plot.plot (
bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).plot_cols (
list[str], optional) – Add additional plotting data to the output filtering plot.overwrite (
bool, optional) – If True, overwrites existing filtered poses files (default is True).storage_format (
str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).fail_on_empty (bool)
- Returns:
The updated Poses instance with filtered poses.
- Return type:
- Raises:
ValueError – If all poses are removed based on the filtering criteria.
Further Details
This method filters the poses DataFrame based on a specified value in a score column, using the provided comparison operator. It supports optional plot generation for visualizing the filtering process and allows saving the filtered poses to a file with a specified prefix and storage format.
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Filter poses by value poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>', prefix='filtered_poses', plot=True)
Notes
The method creates a filtered poses file and an optional plot in the specified working directory.
Ensures that the DataFrame is properly filtered based on the provided criteria.
Logs the filtering process, including any errors or warnings related to the filtering criteria.
Raises a ValueError if the filtering criteria remove all poses, ensuring that the Poses instance retains valid data.
- get_pose(pose_description, all_models=False)[source]
Retrieves a pose structure based on its description.
- Parameters:
- Returns:
The Bio.PDB Model or Structure object corresponding to the specified pose description.
- Return type:
Bio.PDB.Model.ModelorBio.PDB.Structure.Structure- Raises:
KeyError – If the pose description is not found in the poses DataFrame.
Further Details
This method locates the pose file based on its description and loads it as a Bio.PDB Structure object. It is useful for accessing specific pose structures for further analysis or manipulation.
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Retrieve a specific pose structure pose_structure = poses_instance.get_pose('pose1')
Notes
The method uses the ‘poses_description’ column to locate the specified pose.
Ensures that the returned pose is loaded as a Bio.PDB Structure object for further processing.
- load_poses(poses_path)[source]
Loads poses from a specified file and updates the Poses instance.
- Parameters:
poses_path (
str) – The path to the file containing the poses to be loaded.- Returns:
Poses– The updated Poses instance with poses loaded from the specified file.Further Details---------------This method reads a file containing posesandupdates the Poses instance with the data. The file format is automatically detected based on the file extension, andthe corresponding loading function is usedtoread the data into a DataFrame.
- Return type:
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Load poses from a file poses_instance.load_poses('path/to/poses.json')
Notes
The method supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.
- parse_descriptions(poses=None)[source]
Parses descriptions from the provided pose file paths.
- Parameters:
poses (
list, optional) – A list of pose file paths from which descriptions are extracted.- Returns:
list– A list of descriptions parsed from the pose file paths.Further Details---------------This method extracts descriptions from the provided listofpose file paths. Descriptions are derived from the file names by stripping the directory pathandfile extension.
- Return type:
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Parse descriptions from pose file paths descriptions = poses_instance.parse_descriptions(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])
Notes
This method is useful for generating a list of concise descriptions based on file names.
Ensures that descriptions are derived in a consistent format, suitable for use in data management and analysis.
- parse_poses(poses=None, glob_suffix=None)[source]
Parses the input poses, which can be provided as a list or a directory with a glob suffix.
- Parameters:
poses (
Union[list,str], optional) – A list of file paths or a directory containing the protein data files. If not provided, an empty list is returned.glob_suffix (
str, optional) – A suffix used for globbing multiple files in the specified directory.
- Returns:
list– A list of parsed pose file paths.Further Details---------------This method handles various input types for parsing poses. It can parse a listoffile paths directlyorglob files in a specified directory using a suffix. The method ensures that all specified files existandraises appropriate errors if they do not.
- Return type:
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Parse poses from a directory with a glob suffix parsed_poses = poses_instance.parse_poses(poses='path/to/pose_dir', glob_suffix='*.pdb')
Notes
Raises FileNotFoundError if any specified files do not exist.
Supports both single file and multiple file (via globbing) inputs.
Ensures that the returned list contains valid file paths.
- poses_list()[source]
Returns a list of pose file paths from the DataFrame.
- Returns:
list– A list of pose file paths.Further Details---------------This method extracts the pose file paths from the
'poses'column ofthe DataFrameandreturns them as a list. It provides a convenient waytoaccess the stored pose file paths.
- Return type:
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Get the list of pose file paths pose_paths = poses_instance.poses_list()
Notes
The method assumes that the ‘poses’ column exists in the DataFrame.
Provides a simple way to retrieve all pose file paths managed by the Poses instance.
- reindex_poses(prefix, group_col=None, remove_layers=None, force_reindex=False, sep='_', overwrite=False)[source]
Removes index layers from poses. Saves reindexed poses to an output directory.
- Parameters:
prefix (
str) – The directory where the duplicated poses will be saved and the prefix for the DataFrame columns containing the original paths and descriptions.group_col (
str, optional) – The poses dataframe column on which to group to create new descriptions. Must be a column in ‘poses_description’ or ‘poses’ format (e.g. from a previous state, before runners appended index layers)remove_layers (
int, optional) – The number of index layers to remove.force_reindex (
bool, optional) – Add a new index layer to all poses.sep (
str, optional) – The separator used to split the description column into layers.Details (Further)
---------------
(_0001 (This method removes index layers from poses)
_0002
provided (etc). If a group column is)
0 (the poses are assigned names according to the group. If remove_layers is above)
accordingly. (subtracts the set number of layers from the description column and groups the poses)
True (If force_reindex is)
poses. (adds one index layer to all)
overwrite (bool)
- Return type:
None
Notes
The method creates the output directory if it does not exist.
Raises a KeyError if both group_col and remove_layers are set..
Raises a RuntimeError if multiple poses with identical description after index layer removal are found and force_reindex is False..
- reset_poses(new_poses_col='input_poses', force_reset_df=False)[source]
Resets the poses DataFrame to the original input poses, with an option to force reset.
- Parameters:
new_poses_col (
str, optional) – The column in the DataFrame containing the new pose file paths (default is ‘input_poses’).force_reset_df (
bool, optional) – If True, forces a reset of the DataFrame even if the number of new poses does not match the original (default is False).Details (Further)
---------------
parameter. (This method resets the poses DataFrame to use the original input poses. It handles multiline FASTA inputs and ensures that the DataFrame structure is preserved or reset based on the force_reset_df)
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Reset the poses to the original input poses poses_instance.reset_poses()
Notes
The method ensures that the new poses are unique and properly formatted.
Raises a RuntimeError if the number of new poses does not match the original and force_reset_df is False.
Logs warnings and information about the reset process, ensuring data integrity.
- save_poses(out_path, poses_col='poses', overwrite=True)[source]
Saves the poses to a specified directory, with an option to overwrite existing files.
- Parameters:
out_path (
str) – The directory where the poses will be saved.poses_col (
str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).overwrite (
bool, optional) – If True, existing files in the target directory will be overwritten (default is True).Details (Further)
---------------
directory (This method saves the pose files to the specified directory. It copies the pose files from their current locations to the new)
overwritten. (ensuring that the directory structure is maintained. The overwrite parameter controls whether existing files in the target directory are)
- Return type:
None
Example
from poses import Poses # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Save poses to a new directory poses_instance.save_poses(out_path='path/to/new_poses_dir', overwrite=False)
Notes
The method ensures that the target directory exists, creating it if necessary.
If overwrite is set to False, the method skips saving poses that already exist in the target directory.
Logs the saving process, including any skipped files due to the overwrite setting.
- save_scores(out_path=None, out_format=None)[source]
Saves the scores DataFrame to a specified file path in the desired format.
- Parameters:
out_path (
str, optional) – The file path where the scores will be saved. If not provided, the default scorefile path is used.out_format (
str, optional) – The format in which to save the scores. If not provided, the default storage format is used.Details (Further)
---------------
necessary. (This method saves the scores DataFrame to the specified file path in the desired format. It ensures that the file name conforms to the specified format by appending the correct file extension if)
- Return type:
None
Example
from poses import Poses # Initialize the Poses class with some scores poses_instance = Poses() # Save scores to a specific path in CSV format poses_instance.save_scores(out_path='path/to/scores.csv', out_format='csv')
Notes
Supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
The method automatically appends the correct file extension if it is not already present in the out_path.
Ensures that the scores are saved in a format suitable for further analysis and processing.
- set_jobstarter(jobstarter)[source]
Configures the job starter for managing job submissions.
- Parameters:
jobstarter (
JobStarter) – An instance of the JobStarter class used to manage job submissions.- Return type:
None
Further Details
This method sets the job starter for the Poses class, which is used to manage job submissions in distributed computing environments. It allows the user to specify a custom job starter for handling computational tasks.
Example
from poses import Poses from protflow.jobstarters import CustomJobStarter # Initialize the Poses class poses_instance = Poses() # Set a custom job starter custom_jobstarter = CustomJobStarter() poses_instance.set_jobstarter(custom_jobstarter)
Notes
The job starter must be an instance of the JobStarter class or a subclass thereof.
This method enables customization of job management to suit specific computational workflows.
- set_logger()[source]
Configures the logger for the Poses class.
Further Details
This method sets up the logging configuration for the Poses class. It creates a logger that outputs log messages to both the console and a log file in the working directory (if set). This aids in debugging and tracking the progress of data processing operations.
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses(work_dir='path/to/work_dir') # Set up the logger poses_instance.set_logger()
Notes
The log file is named after the working directory and stored within it.
The logging level is set to INFO, and log messages include timestamps, logger names, log levels, and messages.
- Return type:
None
- set_motif(motif_col)[source]
Sets a motif column in the poses DataFrame for further analysis.
- Parameters:
motif_col (
str) – The column in the DataFrame containing the motifs to be set.- Raises:
- Return type:
None
Further Details
This method sets a column in the poses DataFrame to be used as motifs for further analysis. The motifs must be instances of the ResidueSelection class.
Example
from poses import Poses from protflow.residues import ResidueSelection # Initialize the Poses class with some poses poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb']) # Assume we have a column 'motifs' with ResidueSelection objects poses_instance.set_motif('motifs')
Notes
The method ensures that the specified column exists and contains ResidueSelection objects.
Logs any errors encountered during the process for easier debugging and verification.
- set_poses(poses=None, glob_suffix=None)[source]
Sets the poses for the Poses instance, parsing the input if necessary.
- Parameters:
poses (
Union[list,str,pd.DataFrame], optional) – A list of file paths, a directory containing the protein data files, or a DataFrame containing the poses. If not provided, an empty DataFrame is initialized.glob_suffix (
str, optional) – A suffix used for globbing multiple files in the specified directory.Details (Further)
---------------
types (This method initializes the poses for the Poses instance. It can accept various input)
paths (including a list of file)
files (a directory for globbing)
processing. (or a DataFrame. The method ensures that the poses are correctly parsed and set up for further)
- Return type:
None
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Set poses from a directory with a glob suffix poses_instance.set_poses(poses='path/to/pose_dir', glob_suffix='*.pdb') # Set poses from a list of file paths poses_instance.set_poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])
Notes
If a DataFrame is provided, it is directly used as the poses DataFrame after integrity checks.
The method supports parsing multiline FASTA inputs and handles them appropriately.
Ensures that the poses DataFrame contains necessary columns for subsequent operations.
- set_scorefile(work_dir)[source]
Sets the scorefile path for storing protein scores.
- Parameters:
work_dir (
str) – The working directory where the scorefile will be stored. If the work directory is not set, the scorefile is stored in the current directory.- Return type:
None
Notes
This method configures the path for the scorefile based on the provided working directory. If no working directory is specified, the scorefile is stored in the current directory.
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Set the scorefile path poses_instance.set_scorefile(work_dir='path/to/work_dir')
- set_storage_format(storage_format)[source]
Sets the storage format for storing protein data.
- Parameters:
storage_format (
str) – The format used for storing protein data. Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.- Raises:
KeyError – If the provided storage format is not supported.
- Return type:
None
Notes
This method configures the storage format for protein data. It ensures that the format is one of the supported formats and raises an error if the format is invalid.
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Set the storage format to 'csv' poses_instance.set_storage_format('csv')
- set_work_dir(work_dir, set_scorefile=True)[source]
Sets up and configures the working directory for storing data and results.
- Parameters:
- Return type:
None
Further Details
This method creates the necessary subdirectories within the specified working directory to organize score files, filter results, and plots. It ensures that the required directory structure is in place for subsequent data management operations.
Example
from poses import Poses # Initialize the Poses class poses_instance = Poses() # Set the working directory poses_instance.set_work_dir('path/to/new_work_dir')
Notes
The method will log the creation of directories if they do not already exist.
If set_scorefile is set to True, the scorefile path will be configured within the working directory.
- split_multiline_fasta(path, encoding='UTF-8')[source]
Splits a multiline FASTA file into individual FASTA files, each containing a single sequence.
- Parameters:
- Returns:
list[str]– A list of file paths to the individual FASTA files.Further Details---------------This method reads a multiline FASTA fileandsplits it into individual FASTA files, each containing a single sequence. The individual FASTA files are stored in a subdirectory named'input_fastas_split'within the working directory.
- Return type:
Example
from poses import Poses # Initialize the Poses class with a working directory poses_instance = Poses(work_dir='path/to/work_dir') # Split a multiline FASTA file individual_fasta_paths = poses_instance.split_multiline_fasta('path/to/multiline.fasta')
Notes
The method creates a subdirectory named ‘input_fastas_split’ within the working directory to store the individual FASTA files.
The descriptions in the FASTA file are sanitized to replace special characters with underscores.
Raises an AttributeError if the working directory is not set.
- Parameters:
poses (list)
work_dir (str)
storage_format (str)
glob_suffix (str)
jobstarter (JobStarter)
- protflow.poses.class_in_df(df, cls, out_col)[source]
Return a copy of
dfwith a column listing, for each row, the names of columns whose values are instances of a given class (or classes).If no cells in the DataFrame match
cls, the function returns a copy ofdfwithout addingout_col. Empty DataFrames are returned unchanged. Elementwise checks usepandas.DataFrame.map()(pandas ≥ 2.2).- Parameters:
df (
pandas.DataFrame) – Input DataFrame to inspect.cls (
typeortuple[type,]) – Class (or tuple of classes) to test against, as inisinstance(). Examples:dictor(dict, list).out_col (
str) – Name of the output column to add. Each entry will be alist[str]of column names whose values in that row are instances ofcls. The column is only created if at least one match exists anywhere indf.
- Returns:
A copy of
df. If any matches are found, the copy contains an added columnout_colwith per-row lists of matching column names. If no matches are found (ordfis empty), the copy is returned unchanged.- Return type:
Notes
This function does not mutate
df; it returns a modified copy.clsbehaves exactly like the second argument toisinstance().To convert the list results to a delimiter-separated string, you can post-process with:
out[out_col] = out[out_col].apply('|'.join).
Examples
import pandas as pd df = pd.DataFrame({ 'a': [1, {'x': 1}, 3], 'b': [{'y': 2}, 5, [1, 2]], 'c': ['hi', 'there', 'world'], }) check_cols_for_class(df, dict, 'resselector_cols')
- protflow.poses.col_in_df(df, column)[source]
Checks if the specified column(s) exist in the DataFrame.
- Parameters:
df (
pd.DataFrame) – The DataFrame to be checked.column (
strorlist[str]) – The column name or list of column names to check for existence in the DataFrame.
- Raises:
KeyError – If any of the specified columns are not found in the DataFrame.
- Return type:
None
Further Details
This function checks whether the specified column or list of columns exist in the given DataFrame. It is useful for ensuring that the DataFrame contains the necessary columns before performing further operations.
Example
import pandas as pd from poses import col_in_df # Create a sample DataFrame df = pd.DataFrame({ 'col1': [1, 2, 3], 'col2': [4, 5, 6] }) # Check if a column exists col_in_df(df, 'col1') # Check if multiple columns exist col_in_df(df, ['col1', 'col2'])
Notes
The function raises a KeyError if any of the specified columns are not found in the DataFrame.
Ensures that the DataFrame contains the necessary columns for subsequent operations.
- protflow.poses.combine_dataframe_score_columns(df, scoreterms, weights, scale=False)[source]
Combines multiple score columns in a DataFrame into a single composite score, applying weights and normalization.
- Parameters:
df (
pd.DataFrame) – The DataFrame containing the score columns.scoreterms (
list[str]) – The list of score columns to be combined.weights (
list[float]) – The list of weights corresponding to each score column.scale (
bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).
- Returns:
The composite score as a pandas Series.
- Return type:
pd.Series- Raises:
ValueError – If the number of scoreterms and weights do not match.
TypeError – If any score column contains non-numeric values.
Further Details
This function combines multiple score columns in a DataFrame into a single composite score. Each score column is normalized by subtracting the median and dividing by the standard deviation. The normalized scores are then weighted according to the specified weights and summed to create the composite score. Optionally, the composite score can be scaled to a range between 0 and 1.
Example
import pandas as pd from poses import combine_dataframe_score_columns # Create a sample DataFrame data = { 'score1': [10, 20, 30, 40, 50], 'score2': [15, 25, 35, 45, 55] } df = pd.DataFrame(data) # Combine score columns into a composite score composite_score = combine_dataframe_score_columns(df, scoreterms=['score1', 'score2'], weights=[0.5, 0.5], scale=True)
Notes
The method ensures that the number of scoreterms and weights match.
Normalization helps in making the scores comparable by removing scale differences.
Raises a ValueError if the number of scoreterms and weights do not match, ensuring correct input.
The optional scaling step ensures that the composite score remains within a standardized range.
- protflow.poses.filter_dataframe_by_rank(df, col, n, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True)[source]
Filters the DataFrame to retain only the top-ranked rows based on a specified column.
- Parameters:
df (
pd.DataFrame) – The DataFrame to be filtered.col (
str) – The column in the DataFrame used for ranking.n (
Union[float,int]) – The number of top-ranked rows to retain. If n < 1, it represents a fraction of the total rows.group_col (
str, optional) – Group dataframe by this column, then filter individual groups.remove_layers (
int, optional) – The number of layers to remove from the column values before ranking. This helps in grouping similar rows.layer_col (
str, optional) – The column used for layer-based grouping of rows (default is “poses_description”).sep (
str, optional) – The separator used in the layer descriptions (default is “_”).ascending (
bool, optional) – If True, ranks rows in ascending order; otherwise, in descending order (default is True).
- Returns:
pd.DataFrame– The filtered DataFrame containing only the top-ranked rows.Further Details---------------This function filters the DataFrametoretain only the top-ranked rows based on the values in a specified column. It supports fractional ranking,layer-based grouping, andsorting in ascendingordescending order. The function also allows for removing layers from column values before rankingtohandle grouped data.
- Return type:
Example
import pandas as pd from poses import filter_dataframe_by_rank # Create a sample DataFrame data = { 'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'], 'score': [10, 20, 30, 40, 50] } df = pd.DataFrame(data) # Filter the DataFrame to retain the top 3 rows based on the score column filtered_df = filter_dataframe_by_rank(df, col='score', n=3)
Notes
The function raises a KeyError if the specified column is not found in the DataFrame.
Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.
- protflow.poses.filter_dataframe_by_value(df, col, value, operator)[source]
Filters the DataFrame based on a specified value in a column using the provided comparison operator.
- Parameters:
- Returns:
pd.DataFrame– The filtered DataFrame containing only the rows that meet the filtering criteria.Further Details---------------This function filters the DataFrame based on a specified value in a column,using the provided comparison operator. It supports various comparison operators such as greater than,less than,equal to, andnot equal to.
- Return type:
Example
import pandas as pd from poses import filter_dataframe_by_value # Create a sample DataFrame data = { 'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'], 'score': [10, 20, 30, 40, 50] } df = pd.DataFrame(data) # Filter the DataFrame to retain rows where the score is greater than 30 filtered_df = filter_dataframe_by_value(df, col='score', value=30, operator='>')
Notes
The function raises a KeyError if the specified column is not found in the DataFrame.
Ensures that the DataFrame is properly filtered based on the provided criteria.
- protflow.poses.get_format(path)[source]
Returns the appropriate pandas function to load a file based on its extension.
- Parameters:
path (
str) – The path to the file whose format needs to be determined.- Returns:
function– The pandas function corresponding to the file format (e.g., pd.read_json, pd.read_csv).Further Details---------------This function determines the appropriate pandas functiontouse for loading a file based on its extension. It supports various file formats,including JSON,CSV,Pickle,Feather, andParquet.
Example
import pandas as pd from poses import get_format # Determine the format function for a JSON file load_function = get_format('path/to/data.json') # Use the function to load the data df = load_function('path/to/data.json')
Notes
Raises a KeyError if the file format is not supported.
Ensures that the appropriate pandas function is returned based on the file extension.
- protflow.poses.load_poses(poses_path)[source]
Loads poses from a specified file and returns a Poses instance.
- Parameters:
poses_path (
str) – The path to the file containing the poses to be loaded.- Returns:
Poses– A Poses instance with poses loaded from the specified file.Further Details---------------This function reads a file containing posesandreturns a Poses instance with the data. The file format is automatically detected based on the file extension, andthe corresponding loading function is usedtoread the data into a DataFrame.
- Return type:
Example
from poses import Poses, load_poses # Load poses from a file poses_instance = load_poses('path/to/poses.json')
Notes
The function supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.
- protflow.poses.normalize_series(ser, scale=False)[source]
Normalizes a pandas Series by subtracting the median and dividing by the standard deviation, with an option to scale the values.
- Parameters:
ser (
pd.Series) – The pandas Series to be normalized.scale (
bool, optional) – If True, scales the normalized values to a range between 0 and 1 (default is False).
- Returns:
pd.Series– The normalized (and optionally scaled) Series.Further Details---------------This function normalizes a pandas Series by first subtracting the medianandthen dividing by the standard deviation. If the `scaleparameter is set` toTrue,the normalized values are further scaledtoa range between 0and1. This normalization process centers the data around zeroandadjusts for variability,making the values comparable.
- Return type:
Example
import pandas as pd from poses import normalize_series # Create a sample pandas Series sample_series = pd.Series([10, 20, 30, 40, 50]) # Normalize the Series normalized_series = normalize_series(sample_series, scale=True)
Notes
If all values in the Series are the same, the function returns a Series of zeros.
The optional scaling step ensures that the values are adjusted to a standardized range.
- protflow.poses.scale_series(ser)[source]
Scales a pandas Series to a range between 0 and 1.
- Parameters:
ser (
pd.Series) – The pandas Series to be scaled.- Returns:
pd.Series– The scaled Series with values between 0 and 1.Further Details---------------This function scales a pandas Seriestoa range between 0and1. It ensures that the minimum value in the Series becomes 0andthe maximum value becomes 1,with all other values adjusted proportionately.
- Return type:
Example
import pandas as pd from poses import scale_series # Create a sample pandas Series sample_series = pd.Series([10, 20, 30, 40, 50]) # Scale the Series scaled_series = scale_series(sample_series)
Notes
If all values in the Series are the same, the function returns a Series of zeros.
The scaling process adjusts the values to fit within a standardized range, making them comparable.
protflow.residues module
residues
The residues module is a part of the protflow package and is designed to handle residue selection and related operations in protein structures. This module provides functionality to parse, manipulate, and convert residue selections in various formats, making it an essential tool for bioinformatics and computational biology workflows.
The module includes the ResidueSelection class for representing and manipulating selections of residues, as well as various functions for parsing and converting residue selections.
Classes
- ResidueSelection
Represents a selection of residues with functionality for parsing, converting, and manipulating selections.
- AtomSelection
Represents an ordered selection of atoms for atom-level operations.
Functions
- fast_parse_selection
Fast parser for selections already in ResidueSelection format.
- parse_selection
Parses a selection into ResidueSelection formatted selection.
- parse_residue
Parses a single residue identifier into a tuple (chain, residue_index).
- residue_selection
Creates a ResidueSelection from a selection of residues.
- from_dict
Creates a ResidueSelection object from a dictionary specifying a motif.
- from_contig
Creates a ResidueSelection object from a contig string.
- reduce_to_unique
Reduces an input array to its unique elements while preserving order.
Example Usage
Creating and manipulating ResidueSelection objects:
from residues import ResidueSelection, from_dict, from_contig
# Create a ResidueSelection from a list
selection = ResidueSelection(["A1", "A2", "B3"])
# Convert to string
selection_str = selection.to_string()
print(selection_str)
# Output: A1, A2, B3
# Convert to dictionary
selection_dict = selection.to_dict()
print(selection_dict)
# Output: {'A': [1, 2], 'B': [3]}
# Create a ResidueSelection from a dictionary
selection_from_dict = from_dict({"A": [1, 2], "B": [3]})
print(selection_from_dict.to_string())
# Output: A1, A2, B3
# Create a ResidueSelection from a contig string
selection_from_contig = from_contig("A1-A3, B5")
print(selection_from_contig.to_string())
# Output: A1, A2, A3, B5
This module simplifies the process of handling residue selections in bioinformatics workflows, providing a consistent interface for different types of input and output formats.
- class protflow.residues.AtomSelection(atoms)[source]
Bases:
objectRepresent an ordered selection of atoms in a protein structure.
Atom IDs can be compact IDs
(chain_id, res_id, atom_name)using model 0 implicitly, or full BioPython-style IDs with model and structure IDs. Atom ordering is preserved because RMSD calculation pairs atoms by position.- Parameters:
atoms (
AtomSelection,dict,list, ortuple) –Ordered atom selection to normalize. Supported atom ID forms are:
(chain_id, residue_id, atom_name)(model_id, chain_id, residue_id, atom_name)(structure_id, model_id, chain_id, residue_id, atom_name)(structure_id, model_id, chain_id, residue_id, atom_name, altloc)
residue_idcan be a compact integer-like value or a BioPython residue ID tuple(hetero_flag, residue_number, insertion_code).atom_namecan be a string or a BioPython disordered atom ID tuple(atom_name, altloc). A scorefile-style dictionary with an"atoms"key is also accepted.
- atoms
Tuple of normalized atom IDs. Nested lists are converted to tuples so selections can be compared and used in set-like operations.
- Type:
- Raises:
TypeError – If atoms is not an AtomSelection, scorefile dictionary, or ordered sequence of atom IDs.
ValueError – If any atom ID has an unsupported shape or invalid chain, residue, or atom-name component.
- Parameters:
atoms (Any)
Notes
AtomSelection preserves order deliberately. Many atom-level operations, such as RMSD or geometry calculations, pair atoms by position rather than treating the selection as an unordered set.
Examples
Create a compact atom selection:
atoms = AtomSelection([("A", 1, "N"), ("A", 1, "CA")])
Create the same selection from scorefile-compatible data:
atoms = AtomSelection({"atoms": [["A", 1, "N"], ["A", 1, "CA"]]})
- __add__(other)[source]
Combine two AtomSelections while preserving order and uniqueness.
- Parameters:
other (
AtomSelection) – Selection to append toself. Atoms already present inselfare skipped, matching the behavior ofResidueSelection.__add__().- Returns:
AtomSelection– New selection containing all atoms fromselffollowed by atoms fromotherthat were not already present.NotImplemented– Returned when other is not an AtomSelection, allowing Python’s binary operator fallback behavior.
Examples
a = AtomSelection([("A", 1, "N"), ("A", 1, "CA")]) b = AtomSelection([("A", 1, "CA"), ("A", 1, "C")]) (a + b).to_tuple() # (("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"))
- __init__(atoms)[source]
Normalize and store an ordered atom selection.
- Parameters:
atoms (Any)
- Return type:
None
- __sub__(other)[source]
Remove atoms in another AtomSelection from this selection.
- Parameters:
other (
AtomSelection) – Selection whose atoms should be removed fromself.- Returns:
AtomSelection– New selection containing atoms fromselfwhose normalized atom IDs are absent fromother. Original order is preserved.NotImplemented– Returned when other is not an AtomSelection.
Examples
a = AtomSelection([("A", 1, "N"), ("A", 1, "CA")]) b = AtomSelection([("A", 1, "CA")]) (a - b).to_tuple() # (("A", 1, "N"),)
- static from_dict(input_dict, pose=None, residue_id_format='auto')[source]
Create an AtomSelection from a scorefile dict, nested atom dict, or RFD3 dict.
This is the dictionary-oriented constructor for AtomSelection. It supports three dictionary dialects:
{"atoms": [...]}for ProtFlow scorefile-compatible atom selections.{"A": {1: ["N", "CA"]}}for explicit chain/residue/atom-name mappings.RFD3 InputSelection dictionaries such as
{"A1-2": "BKBN", "LIG": "C1,O1"}.
- Parameters:
input_dict (
dict) – Dictionary describing an atom selection in one of the supported forms listed above.pose (
str,os.PathLike,Bio.PDB entity, optional) – Input structure used to expand RFD3 aliases or residue-name selectors. A pose is required when values useALLorTIP, when keys select ligands/residue names, or when exact atom names should be checked against the input structure.residue_id_format (
{"auto", "compact", "biopython"}, optional) – Controls how residue IDs are written when atoms are read from pose."auto"uses compact integer residue IDs for standard residues and BioPython residue IDs for hetero residues."compact"always writes integer residue IDs."biopython"always writes BioPython residue IDs.
- Returns:
Normalized atom selection described by input_dict.
- Return type:
- Raises:
TypeError – If input_dict is not a dictionary or if atom-name values have an unsupported type.
ValueError – If the dictionary uses structure-dependent syntax but no pose is provided, or if requested atoms/components cannot be resolved.
Examples
Parse scorefile-compatible data:
AtomSelection.from_dict({"atoms": [["A", 1, "N"], ["A", 1, "CA"]]})
Parse a nested chain/residue mapping:
AtomSelection.from_dict({"A": {1: ["N", "CA"], 2: "C,O"}})
Parse an RFD3 InputSelection dictionary against a PDB file:
AtomSelection.from_dict({"A1-2": "BKBN", "LIG": "C1,O1"}, pose="input.pdb")
- static from_list(atoms)[source]
Create an AtomSelection from an ordered list or tuple of atom IDs.
- Parameters:
atoms (
listortuple) – Ordered atom IDs in any format accepted byAtomSelection. Passing a single atom ID such as("A", 1, "N")is also supported.- Returns:
Normalized atom selection preserving the order supplied in atoms.
- Return type:
- Raises:
TypeError – If atoms is not sequence-like.
ValueError – If any atom ID is malformed.
Examples
AtomSelection.from_list([("A", 1, "N"), ("A", 1, "CA")])
- static from_rfd3_contig(input_contig, pose=None, atom_names='ALL', model_id=0, residue_id_format='auto')[source]
Create an AtomSelection from indexed parts of an RFD3 contig string.
Generated-length components such as
10/10-20and chain breaks like/0are skipped. Withposeprovided,atom_names="ALL"expands to the atoms present in the structure and ligand/residue-name components can be resolved. Without a pose,atom_namesmust be an explicit atom list or an alias that does not require structure context such asBKBN.- Parameters:
input_contig (
str) – RFD3 contig string. Indexed residue components such as"A1","A1-5", and"A1-A5"are converted to atom IDs. Diffused length components and chain breaks are ignored because they do not refer to atoms in the input structure.pose (
str,os.PathLike,Bio.PDB entity, optional) – Input structure used to expandALLatoms, validate explicit atom names, and resolve ligand/residue-name components. If omitted, only indexed residue components with explicit atom-name values can be parsed.atom_names (
str,list, ortuple, optional) – Atom names to select from every indexed component. Supported RFD3 aliases are"ALL","BKBN", and"TIP". Explicit names can be supplied as comma-separated strings such as"N,CA,C,O"or as lists/tuples of strings.model_id (
intorstr, optional) – BioPython model identifier used when pose is a Structure object or a path to a multi-model file. Defaults to0.residue_id_format (
{"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.
- Returns:
Atom selection for the indexed input components in input_contig.
- Return type:
- Raises:
TypeError – If input_contig is not a string.
ValueError – If a selected component or requested atom cannot be resolved, or if structure-dependent syntax is used without pose.
Examples
Select backbone atoms from indexed residues without loading a pose:
AtomSelection.from_rfd3_contig("10,A1-2,/0,B5", atom_names="BKBN")
Select all atoms present in an input structure:
AtomSelection.from_rfd3_contig("A1-2,/0,Z9", pose="input.pdb")
- static from_rfd3_input_selection(input_selection, pose=None, model_id=0, residue_id_format='auto')[source]
Create an AtomSelection from an RFD3 InputSelection value.
Supported RFD3 forms are booleans, contig-style strings, and dictionaries whose keys are residue/ligand selections and whose values are atom names,
ALL,BKBN,TIP, or explicit atom-name lists. A pose is required for booleans,ALL,TIP, and ligand/residue name selection because those cases need the actual atoms in the input structure.- Parameters:
input_selection (
None,bool,str,dict,AtomSelection,list, ortuple) –RFD3 InputSelection-like value to parse. Supported forms are:
NoneReturns an empty AtomSelection.
True/FalseSelect all atoms in pose or no atoms, respectively.
strParses a contig-style selector such as
"A1-10,B5"or a ligand/residue name such as"LIG". String selections implyALLatoms for matching components.dictParses RFD3 dictionary syntax where keys are components and values are atom selectors, e.g.
{"A1": "BKBN"}.AtomSelectionor atom-ID list/tupleNormalizes the existing atom selection directly.
pose (
str,os.PathLike,Bio.PDB entity, optional) – Input structure used for syntax that depends on actual atoms or residue names.model_id (
intorstr, optional) – BioPython model identifier used for structure-backed parsing.residue_id_format (
{"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.
- Returns:
Normalized atom selection represented by input_selection.
- Return type:
- Raises:
TypeError – If input_selection has an unsupported type.
ValueError – If the selection requires a structure but pose is absent, or if selected residues/atoms cannot be found.
Notes
This parser mirrors the user-facing RFD3 InputSelection grammar without importing RFD3 or Foundry at runtime. It intentionally returns concrete atom IDs rather than RFD3 masks.
Examples
Parse explicit atoms without a pose:
AtomSelection.from_rfd3_input_selection({"A1-2": "BKBN"})
Parse ligand atoms and TIP atoms from a structure:
AtomSelection.from_rfd3_input_selection({"LIG": "ALL", "A20": "TIP"}, pose="input.pdb")
- static from_rfd3_input_spec(input_spec, pose=None, fields=None, include_ligand=True, model_id=0, residue_id_format='auto')[source]
Parse RFD3 InputSelection fields from one InputSpecification.
Returns a dictionary mapping each parsed field name to an AtomSelection. If
poseis not provided,input_spec["input"]is used when present. The RFD3ligandfield is included by default even though it is not typed as InputSelection in RFD3 itself.- Parameters:
input_spec (
dict) – One RFD3 InputSpecification dictionary, for example one value from anRFD3Paramsobject.pose (
str,os.PathLike,Bio.PDB entity, optional) – Input structure used to resolve InputSelection fields. When omitted,input_spec["input"]is used if present.fields (
listortupleofstr, optional) – InputSelection field names to parse. Defaults to all RFD3 InputSelection fields known to ProtFlow:contig,unindex,select_fixed_atoms,select_unfixed_sequence,select_buried,select_partially_buried,select_exposed,select_hbond_donor,select_hbond_acceptor, andselect_hotspots.include_ligand (
bool, optional) – IfTrue(default), parse the RFD3ligandfield into an AtomSelection under the key"ligand".model_id (
intorstr, optional) – BioPython model identifier used for structure-backed parsing.residue_id_format (
{"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.
- Returns:
Mapping from each parsed input-specification field to the corresponding AtomSelection. Fields absent from input_spec or set to
Noneare omitted.- Return type:
dict[str,AtomSelection]- Raises:
TypeError – If input_spec is not a dictionary.
ValueError – If any requested field cannot be resolved to atoms.
Examples
Parse all atom-level selections from an RFD3 spec:
spec = { "input": "input.pdb", "contig": "A1-20,/0,50-80", "select_fixed_atoms": {"A10": "BKBN", "LIG": "C1,O1"}, "ligand": "LIG", } selections = AtomSelection.from_rfd3_input_spec(spec) fixed_atoms = selections["select_fixed_atoms"]
- static from_rfd3_ligand(ligand, pose, model_id=0, residue_id_format='auto')[source]
Create an AtomSelection from an RFD3 ligand specification.
Ligands can be selected by residue name (
"LIG"or"LIG,ACT") or by indexed residue components such as"Z9".- Parameters:
ligand (
str) – RFD3 ligand selector. Comma-separated residue names select all matching non-protein residues in the input structure. Indexed residue components such as"Z9"can also be used.pose (
str,os.PathLike,Bio.PDB entity) – Input structure containing the ligand atoms. This argument is required because ligand names must be resolved against the actual structure.model_id (
intorstr, optional) – BioPython model identifier used for structure-backed parsing.residue_id_format (
{"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.
- Returns:
Selection containing all atoms selected by the ligand specification.
- Return type:
- Raises:
ValueError – If pose is omitted or if the ligand selector does not match the input structure.
Examples
Select all atoms in ligands named
LIGandACT:AtomSelection.from_rfd3_ligand("LIG,ACT", pose="input.pdb")
- class protflow.residues.ResidueSelection(selection=None, delim=',', fast=False, from_scorefile=False)[source]
Bases:
objectRepresent a selection of residues in a protein structure.
A selection of residues is represented as a tuple with the hierarchy ((chain, residue_idx), …).
- Parameters:
selection (
list, optional) – A list of residues in string format, e.g., [“A1”, “A2”, “B3”]. Default is None.delim (
str, optional) – The delimiter used to parse the selection string. Default is “,”.fast (
bool, optional) – If True, parses the selection without any type checking. Use when selection is already in ResidueSelection format. Default is False.from_scorefile (bool)
Examples
>>> from residues import ResidueSelection >>> selection = ResidueSelection(["A1", "A2", "B3"]) >>> print(selection.to_string()) A1, A2, B3 >>> print(selection.to_dict()) {'A': [1, 2], 'B': [3]}
- from_selection(selection)[source]
Constructs a ResidueSelection instance from the provided selection.
- to_dict()[source]
Converts the ResidueSelection to a dictionary.
Note
Converting to a dictionary destroys the ordering of specific residues on the same chain in a motif.
- Returns:
A dictionary representation of the ResidueSelection with chains as keys and lists of residue indices as values.
- Return type:
Examples
>>> selection = ResidueSelection(["A1", "A2", "B3"]) >>> print(selection.to_dict()) {'A': [1, 2], 'B': [3]}
- to_list(ordering=None)[source]
Converts the ResidueSelection to a list of strings.
- Parameters:
ordering (
str, optional) – Specifies the ordering of the residues in the output list. Options are “rosetta” or “pymol”. Default is None.- Returns:
The list representation of the ResidueSelection.
- Return type:
Examples
>>> selection = ResidueSelection(["A1", "A2", "B3"]) >>> print(selection.to_list()) ['A1', 'A2', 'B3'] >>> print(selection.to_list(ordering="rosetta")) ['1A', '2A', '3B']
- to_rfdiffusion_contig()[source]
Parses ResidueSelection object to contig string for RFdiffusion.
Example
If self.residues = ((“A”, 1), (“A”, 2), (“A”, 3), (“C”, 4), (“C”, 6)), the output will be “A1-3,C4,C6”.
- Return type:
- to_string(delim=',', ordering=None)[source]
Converts the ResidueSelection to a string.
- Parameters:
- Returns:
ResidueSelection object formatted as a string, separated by :delim: ueSelection.
- Return type:
Examples
>>> selection = ResidueSelection(["A1", "A2", "B3"]) >>> print(selection.to_string()) A1, A2, B3 >>> print(selection.to_string(ordering="rosetta")) 1A, 2A, 3B
- protflow.residues.fast_parse_selection(input_selection)[source]
Fast selection parser for pre-formatted selections.
This function is a fast parser for residue selections that are already in the ResidueSelection format. It bypasses any additional type checking or parsing to improve performance when the input is guaranteed to be correctly formatted.
- Parameters:
input_selection (
tupleoftupleof(str,int)) – A tuple of tuples where each inner tuple represents a residue with the format (chain, residue_index).- Returns:
The input selection, unchanged.
- Return type:
Examples
>>> input_selection = (("A", 1), ("B", 2), ("C", 3)) >>> fast_parse_selection(input_selection) (('A', 1), ('B', 2), ('C', 3))
- protflow.residues.from_contig(input_contig)[source]
Creates a ResidueSelection object from a contig string.
This function constructs a ResidueSelection instance from a contig string. The contig string can specify ranges of residues using a hyphen (-) to denote the range, with residues separated by commas (,). For example, “A1-A3, B5” specifies residues A1, A2, A3, and B5.
- Parameters:
input_contig (
str) – A contig string specifying the residues. Ranges can be denoted using hyphens, and residues are separated by commas.- Returns:
An instance of the ResidueSelection class representing the parsed selection of residues.
- Return type:
Examples
>>> from_contig("A1-A3, B5") <ResidueSelection object representing ('A', 1), ('A', 2), ('A', 3), ('B', 5)>
>>> from_contig("C1, C3-C5, D2") <ResidueSelection object representing ('C', 1), ('C', 3), ('C', 4), ('C', 5), ('D', 2)>
- protflow.residues.from_dict(input_dict)[source]
Creates a ResidueSelection object from a dictionary.
This function constructs a ResidueSelection instance from a dictionary where the keys represent chain identifiers and the values are lists of residue indices. This format specifies a motif in the following way: {chain: [residues], …}.
- Parameters:
input_dict (
dict) – A dictionary specifying the motif. The keys are chain identifiers (str) and the values are lists of residue indices (int).- Returns:
An instance of the ResidueSelection class representing the parsed selection of residues.
- Return type:
Examples
>>> input_dict = {"A": [1, 2], "B": [3, 4]} >>> from_dict(input_dict) <ResidueSelection object representing ('A', 1), ('A', 2), ('B', 3), ('B', 4)>
- protflow.residues.parse_from_scorefile(input_selection)[source]
Helper to parse ResidueSelection object from ProtFlow scorefile format.
- protflow.residues.parse_residue(residue_identifier)[source]
Parses a single residue identifier into a tuple (chain, residue_index).
This function takes a residue identifier string and parses it into a tuple containing the chain identifier and the residue index. It currently only supports single-letter chain identifiers.
- Parameters:
residue_identifier (
str) – A string representing the residue identifier. The format is expected to be either “chain+residue_index” or “residue_index+chain”, where “chain” is a single letter and “residue_index” is an integer.- Returns:
A tuple containing the chain identifier and the residue index.
- Return type:
tupleof(str,int)
Examples
>>> parse_residue("A123") ('A', 123)
>>> parse_residue("123A") ('A', 123)
Notes
The function determines whether the chain identifier is at the beginning or the end of the string based on whether the first character is a digit.
Only single-letter chain identifiers are supported.
- protflow.residues.parse_selection(input_selection, delim=',', fast=False, from_scorefile=False)[source]
Parses a selection into ResidueSelection formatted selection.
This function takes a selection of residues in various formats and parses it into the ResidueSelection format, which is a tuple of tuples. Each inner tuple represents a residue with the format (chain, residue_index).
- Parameters:
input_selection (
str,list, ortuple) – The selection of residues to be parsed. This can be: - A string with residues separated by a delimiter. - A list or tuple of residue strings. - A list or tuple of lists/tuples, where each inner list/tuple represents a residue.delim (
str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.fast (
bool, optional) – If True, uses fast_parse_selection to bypass type checking and parsing for performance reasons. Use when input_selection is already in the correct format. Default is False.from_scorefile (
bool, optional) – If True, parses a residue selection that was read in from a scorefile (in the form {‘residues’: [[‘A’, 1], [‘B’, 3]}). Default is False.
- Returns:
A tuple of tuples where each inner tuple represents a residue in the format (chain, residue_index).
- Return type:
- Raises:
TypeError – If input_selection is not a supported type (str, list, or tuple).
Examples
>>> parse_selection("A1, B2, C3") (('A', 1), ('B', 2), ('C', 3))
>>> parse_selection(["A1", "B2", "C3"]) (('A', 1), ('B', 2), ('C', 3))
>>> parse_selection([["A", 1], ["B", 2], ["C", 3]]) (('A', 1), ('B', 2), ('C', 3))
>>> parse_selection([("A", 1), ("B", 2), ("C", 3)], fast=True) (('A', 1), ('B', 2), ('C', 3))
- protflow.residues.reduce_to_unique(input_array)[source]
Reduces an input array to its unique elements while preserving order.
This function takes a list or tuple and returns a new list or tuple containing only the unique elements from the input, with their original order preserved. The type of the returned collection matches the type of the input.
- Parameters:
input_array (
listortuple) – The input array from which to remove duplicate elements. The order of the elements is preserved.- Returns:
A new list or tuple containing only the unique elements from the input array, with the original order preserved.
- Return type:
Examples
>>> reduce_to_unique([1, 2, 2, 3, 1]) [1, 2, 3]
>>> reduce_to_unique(("a", "b", "a", "c", "b")) ('a', 'b', 'c')
Notes
The function uses OrderedDict.fromkeys to remove duplicates while preserving order.
The returned collection is of the same type as the input (list or tuple).
- protflow.residues.residue_selection(input_selection, delim=',')[source]
Creates a ResidueSelection from a selection of residues.
This function takes an input selection of residues in various formats and creates a ResidueSelection object. The selection can be provided as a string, list, or tuple.
- Parameters:
input_selection (
str,list, ortuple) –- The selection of residues to be parsed. This can be:
A string with residues separated by a delimiter.
A list or tuple of residue strings.
A list or tuple of lists/tuples, where each inner list/tuple represents a residue.
delim (
str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.
- Returns:
An instance of the ResidueSelection class representing the parsed selection of residues.
- Return type:
Examples
>>> residue_selection("A1, B2, C3") <ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>
>>> residue_selection(["A1", "B2", "C3"]) <ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>
>>> residue_selection([["A", 1], ["B", 2], ["C", 3]]) <ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>
protflow.runners module
runners module
This module provides functionality for handling the interaction between runners and poses in protein data processing workflows.
It includes classes and utility functions to:
Manage the output from runner processes.
Define abstract runner interfaces.
Parse and manage command-line options and flags for runner processes.
Dependencies:
builtins: logging, os, re
pandas
protflow.poses: Poses, get_format, FORMAT_STORAGE_DICT
protflow.jobstarters: JobStarter
Overview:
The runners module is designed to facilitate the integration of various runner processes with protein pose data, ensuring consistent data formatting, error handling, and integration of results into the Poses class. Utility functions provided in this module support the parsing and handling of command-line options and flags, making it easier to configure and execute runner processes in a flexible manner.
Notes
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.
Author
Markus Braun, Adrian Tripp
Version
0.1.0
- class protflow.runners.Runner[source]
Bases:
objectAbstract Runner base class
The Runner class provides an abstract base for defining runners that handle the interface between runner processes and the Poses class. It includes methods for running jobs, checking paths, verifying prefixes, preparing pose options, and managing job setup and score files.
Examples
To create a custom runner, subclass Runner and implement the abstract methods:
>>> class MyRunner(Runner): >>> def __str__(self): >>> return "MyRunner" >>> >>> def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput: >>> # Custom implementation for running jobs >>> pass
Example usage:
>>> my_runner = MyRunner() >>> poses = Poses() >>> jobstarter = JobStarter() >>> runner_output = my_runner.run(poses, "example_prefix", jobstarter)
- exception CrashError[source]
Bases:
RuntimeErrorRe-raised error with job stderr context when collect_scores fails.
- __str__()[source]
Abstract method to provide the name of the runner.
This method should be overridden in subclasses to return the name of the runner.
- Raises:
NotImplementedError – If the method is not overridden in the subclass.
Examples
>>> class MyRunner(Runner): >>> def __str__(self): >>> return "MyRunner"
- check_for_existing_scorefile(scorefile, overwrite=False)[source]
Checks if a scorefile exists and returns it as a DataFrame if overwrite is False.
- Parameters:
- Returns:
The scorefile as a DataFrame if it exists and overwrite is False. None otherwise.
- Return type:
Examples
>>> runner = MyRunner() >>> scores_df = runner.check_for_existing_scorefile("/path/to/scorefile.csv")
- check_for_prefix(prefix, poses)[source]
Checks if a column with the given prefix already exists in the Poses DataFrame.
- Parameters:
prefix (
str) – The prefix to be checked.poses (
Poses) – An instance of the Poses class whose DataFrame will be checked.
- Raises:
KeyError – If a column with the given prefix already exists in the Poses DataFrame.
- Return type:
None
Examples
>>> runner = MyRunner() >>> poses = Poses() >>> runner.check_for_prefix("example_prefix", poses)
- generic_run_setup(poses, prefix, jobstarters, make_work_dir=True)[source]
Sets up the runner’s working directory and jobstarter.
Checks if the prefix exists in poses.df, sets up a jobstarter, and creates the working directory if necessary.
Returns absolute path to working directory and the jobstarter that will be used for the runner.
- Parameters:
poses (
Poses) – An instance of the Poses class.prefix (
str) – The prefix to be used for the setup.jobstarters (
list[JobStarter]) – A list of JobStarter instances to choose from.make_work_dir (
bool, optional) – Whether to create the working directory if it does not exist (default is True).Note (
Orderofjobstarters in :jobstarter:parameter is:[Runner.run(jobstarter),Runner.jobstarter,poses.default_jobstarter])
- Returns:
A tuple containing the path to the working directory and the selected JobStarter instance.
- Return type:
tuple[str,JobStarter]- Raises:
ValueError – If no valid JobStarter is set.
Examples
>>> runner = MyRunner() >>> poses = Poses() >>> jobstarters = [JobStarter(), JobStarter(), JobStarter()] >>> work_dir, jobstarter = runner.generic_run_setup(poses, "example_prefix", jobstarters)
- prep_pose_options(poses, pose_options=None)[source]
Prepares pose options, ensuring they are of the same length as the poses.
- Parameters:
poses (
Poses) – An instance of the Poses class.pose_options (
list[str], optional) – A list of pose options to be prepared. If not provided, an empty list will be used.
- Returns:
A list of prepared pose options.
- Return type:
- Raises:
ValueError – If the length of pose_options does not match the length of poses.
Examples
>>> runner = MyRunner() >>> poses = Poses() >>> prepared_options = runner.prep_pose_options(poses, ["option1", "option2"])
- run(poses, prefix, jobstarter)[source]
Abstract method to run jobs and send scores to Poses.
This method should be overridden in subclasses to define the job execution logic and integrate the results into the Poses class.
- Parameters:
poses (
Poses) – An instance of the Poses class to be processed.prefix (
str) – Prefix to be added to the results columns.jobstarter (
JobStarter) – An instance of the JobStarter class to handle job execution.
- Returns:
An instance of the RunnerOutput class containing the processed results.
- Return type:
- Raises:
NotImplementedError – If the method is not overridden in the subclass.
Examples
>>> class MyRunner(Runner): >>> def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput: >>> # Custom implementation for running jobs >>> pass
- save_runner_scorefile(scores, scorefile)[source]
Saves the runner’s scorefile based on the file extension format.
- Parameters:
scores (
pandas.DataFrame) – The DataFrame containing the scores to be saved.scorefile (
str) – The path to the scorefile to be saved.
- Raises:
KeyError – If the file extension format is not recognized.
- Return type:
None
Examples
>>> runner = MyRunner() >>> scores_df = pd.DataFrame({'score': [1, 2, 3]}) >>> runner.save_runner_scorefile(scores_df, "/path/to/scorefile.csv")
- search_path(input_path, path_name, is_dir=False)[source]
Checks if a given path exists and is valid.
- Parameters:
- Returns:
The validated path.
- Return type:
- Raises:
ValueError – If the path is not set or does not exist on the local filesystem.
Examples
>>> runner = MyRunner() >>> valid_path = runner.search_path("/path/to/file", "example_path")
- class protflow.runners.RunnerOutput(poses, results, prefix, index_layers=0, index_sep='_')[source]
Bases:
objectRunnerOutput class
The RunnerOutput class handles how protein data is passed between Runner and Poses classes. It ensures the correct formatting of results and facilitates the integration of runner outputs into the Poses data structure.
- param poses:
An instance of the Poses class.
- type poses:
Poses- param results:
A DataFrame containing the results to be checked and formatted. The DataFrame must contain ‘description’ and ‘location’ columns.
- type results:
- param prefix:
A prefix to be added to the results columns.
- type prefix:
- param index_layers:
Number of index layers to remove from the ‘description’ column (default is 0).
- type index_layers:
int, optional- param index_sep:
Separator used in the index (default is “_”).
- type index_sep:
str, optional
- check_data_formatting(results)[source]
Checks if the input DataFrame has the correct format.
- Parameters:
results (
pandas.DataFrame) – The input DataFrame to be checked. It must contain ‘description’ and ‘location’ columns.- Returns:
The validated and formatted DataFrame.
- Return type:
- Raises:
ValueError – If the input DataFrame does not contain the required columns or if the ‘description’ column does not match the ‘location’ column.
- return_poses()[source]
Integrates the output of a runner into a Poses class.
This method adds the output of a Runner class formatted in RunnerOutput into Poses.df and returns the updated Poses instance.
- Returns:
The updated Poses instance with the integrated runner output.
- Return type:
Poses- Raises:
ValueError – If merging DataFrames fails due to no overlap between Poses.df[‘poses_description’] and results[new_df_col] or if some rows in results[new_df_col] were not found in Poses.df[‘poses_description’].
- class protflow.runners.SbatchArrayRunnerTimer(runner)[source]
Bases:
RunnerSbatchArrayRunnerTimer Class
Instrumentation wrapper that profiles any ProtFlow Runner on SLURM.
SbatchArrayRunnerTimerwraps an arbitraryRunnerinstance and, after each call torun(), queries SLURM’s accounting database viaget_SLURM_stats()to collect per-job resource statistics. All timing and statistics records are accumulated inhistoryand can be exported at any time viareport().The class inherits from
Runnerand uses__getattr__()to transparently proxy every attribute lookup to the wrapped runner, so it can serve as a drop-in replacement in any ProtFlow pipeline without modifying the surrounding code.Warning
Profiling relies on
get_SLURM_stats(), which callssacctand therefore requires the process to be running on the cluster login node. Seeget_SLURM_stats()for details.- param runner:
Any instantiated ProtFlow
Runner(e.g.CalibySequenceDesign,LigandMPNN, etc.) whoserun()calls should be timed and profiled.- type runner:
- history
Accumulated statistics records. Each entry corresponds to one successfully profiled
run()call and contains all keys returned byget_SLURM_stats()plus the four keys added byrun()(runner_class,prefix,total_python_wall_sec,overhead_plus_queue_sec). Empty until the first successful profiled run completes.
- job_ids
SLURM job names recorded for each
run()call, in call order. An entry ofNoneindicates thatlast_job_namecould not be retrieved (e.g. because a non-SLURM jobstarter was used and the guard did not fire before the append).
- session_start
ISO-8601 timestamp (
YYYY-MM-DDTHH:MM:SS) set at construction time to one minute before instantiation. Passed as start_time to everyget_SLURM_stats()call so that only jobs from the current session are returned bysacct, preventing false matches against stale jobs with the same name from earlier sessions.- Type:
Notes
__init__callssuper().__init__()to satisfy theRunnerbase-class contract, making all base-class utilities (e.g. scorefile helpers) available onselfin addition to the wrappedself.runner.session_startis backdated by one minute to guard against off-by-one errors on clusters with coarsesaccttimestamp resolution.historygrows unboundedly acrossrun()calls within the same Python session. For very long pipelines, consider callingreport()periodically and resettingself.history = []if memory usage is a concern.
Examples
Wrap a LigandMPNN runner and time three sequential design rounds:
from protflow.runners.ligandmpnn import LigandMPNN from protflow.runners.sbatch_array_runner_timer import SbatchArrayRunnerTimer timed_runner = SbatchArrayRunnerTimer(LigandMPNN()) for prefix in ["round1", "round2", "round3"]: poses = timed_runner.run(poses, prefix=prefix, nseq=20) summary = timed_runner.report(prefix="full_pipeline") print(summary[["prefix", "total_python_wall_sec", "avg_task_runtime_sec"]])
- report(prefix=None)[source]
Export accumulated timing and SLURM statistics to disk and return as a DataFrame.
Converts
historyto aDataFrameand, when prefix is provided, writes two files to the current working directory:<prefix>_stats.csv— the full statistics table, one row per profiledrun()call, written withto_csv()(index column included).<prefix>_job_ids.txt— a newline-delimited list of all SLURM job names fromjob_ids, in the order the runs were performed.
- Parameters:
prefix (
str, optional) – Filename stem for the output files. WhenNone, no files are written and only the in-memory DataFrame is returned. When provided, both output files are created or overwritten in the current working directory.- Returns:
DataFrame built from
history, with one row per profiledrun()call. Columns are the union of all keys present inhistoryentries. Guaranteed columns (when at least one profiled run has completed) include:runner_classstrClass name of the wrapped runner for that run.
prefixstrThe prefix used in that
run()call.total_python_wall_secfloatTotal Python wall-clock time for that run (seconds).
overhead_plus_queue_secfloatEstimated overhead + queue-wait time (seconds).
job_namestrSLURM job name queried by
get_SLURM_stats().total_cpu_secintTotal CPU-core-seconds reserved across all tasks.
avg_task_runtime_secfloatMean per-task wall-clock elapsed time (seconds).
max_task_runtime_secintLongest per-task wall-clock elapsed time (seconds).
min_task_runtime_secintShortest per-task wall-clock elapsed time (seconds).
num_tasksintNumber of SLURM array tasks.
total_cpus_reservedintTotal CPU cores allocated across all tasks.
statestrAggregated job-array completion state.
queried_afterstr or NoneThe
sacctstart-time filter used for that query.
Returns an empty
DataFramewhenhistoryis empty (i.e. before any profiled run has completed, or when all runs used a non-SLURM jobstarter).- Return type:
Notes
report()is called automatically at the end of every successful profiledrun()call using that run’s prefix, so the CSV and job-ID files are always up to date after each run. Manual calls toreport()are useful for retrieving an in- memory summary or writing a consolidated report under a different prefix after multiple runs.The job-ID file is written from
job_ids(not from thejob_namecolumn ofhistory), which means it includes entries from runs wherelast_job_namewasNoneor where the non-SLURM guard fired before the append.Nonevalues will appear as the literal string"None"in the file.Output files are written with UTF-8 encoding and will overwrite existing files of the same name without prompting.
Examples
Inspect stats after two runs and write a combined report:
timed = SbatchArrayRunnerTimer(CalibySequenceDesign()) poses = timed.run(poses, prefix="round1", nseq=5) poses = timed.run(poses, prefix="round2", nseq=10) df = timed.report(prefix="pipeline_summary") # Writes: # pipeline_summary_stats.csv # pipeline_summary_job_ids.txt print(df[["prefix", "total_python_wall_sec", "avg_task_runtime_sec"]]) # prefix total_python_wall_sec avg_task_runtime_sec # 0 round1 245.12 228.40 # 1 round2 510.87 491.33
In-memory summary without writing files:
df = timed.report() # prefix=None — no files written print(df[["state", "num_tasks", "total_cpu_sec"]].to_string())
- run(poses, prefix, jobstarter=None, **kwargs)[source]
Execute the wrapped runner and collect timing and SLURM statistics.
Delegates the actual computation to
runnerviaself.runner.run(poses, prefix, jobstarter, **kwargs)and then, if aSbatchArrayJobstarterwas used, queries SLURM’s accounting database for per-job resource statistics usingget_SLURM_stats(). The combined timing and cluster stats record is appended tohistoryandreport()is called automatically to persist an up-to-date CSV and job-ID file.The method measures time across three consecutive phases:
Phase 1 — wrapper start:
time.perf_counter()is captured immediately before delegating to the wrapped runner.Phase 2 — runner execution: the full body of
self.runner.run(), which internally performs ProtFlow setup, submits the SLURM array job, blocks until all tasks complete (wait=True), and post-processes the results.Phase 3 — wrapper end:
time.perf_counter()is captured immediately after the wrapped runner returns.
- Parameters:
poses (
Poses) – Input pose collection, forwarded verbatim toself.runner.run.prefix (
str) – Column prefix and working-directory identifier forwarded toself.runner.runand used to name the output CSV and job-ID files written byreport().jobstarter (
JobStarter, optional) – Job submission backend. When provided, this value is passed to the wrapped runner and is also used to determine whether SLURM accounting can be queried. When omitted, the jobstarter is resolved fromself.runner.jobstarterand then fromposes.default_jobstarterfor the purpose of stat collection.**kwargs – All additional keyword arguments are forwarded unchanged to
self.runner.run, making the timer fully compatible with any runner regardless of its specific signature.
- Returns:
Poses– ThePosesobject returned by the wrapped runner, unchanged. Timing and statistics are stored inhistoryand written to disk byreport(); they do not alter the returned poses.Side Effects------------When profiling succeeds (SLURM jobstarter detected andlast_job_nameis set),the following side effects occur* **5-second sleep** inserted via ``time.sleep(5)``toallow the– SLURM accounting database to synchronise beforesacctis queried.* A statistics dictionary is **appended to** :attr:`history. The` – dictionary contains all keys fromget_SLURM_stats()(see its return-value documentation) plus the following four keys added by this method:runner_classstr__class__.__name__of the wrapped runner (e.g."CalibySequenceDesign").prefixstrThe prefix argument passed to this call.
total_python_wall_secfloatTotal elapsed wall-clock time in seconds from Python’s perspective (Phase 1 → Phase 3), rounded to 2 decimal places. Encompasses ProtFlow setup, SLURM queue wait, cluster execution, and result post-processing.
overhead_plus_queue_secfloattotal_python_wall_secminusruntime_secfrom SLURM, rounded to 2 decimal places. Approximates the combined cost of ProtFlow overhead and scheduler queue wait. May be negative in rare cases due to clock skew between the login node and compute nodes, or rounding insacct.
* The SLURM job name is **appended to** :attr:`job_ids.`* ``<prefix>_stats.csv``and<prefix>_job_ids.txtare written – (or overwritten) in the current working directory viareport().
- Warns:
logging.WARNING – Emitted when the resolved jobstarter is not an instance of
SbatchArrayJobstarter. Message format:"Stats skipped: <type> does not support SLURM accounting.". Profiling is skipped entirely and the unmodified poses are returned immediately.- Return type:
Notes
The jobstarter resolution priority is: argument →
self.runner.jobstarter→poses.default_jobstarter. This mirrors the fallback chain used by most ProtFlow runners and ensures that the correct jobstarter is identified for stat collection even when it was set on the runner at construction time.total_python_wall_secincludes SLURM queue wait time because the wrapped runner callsstart()withwait=True, blocking until all array tasks complete before returning.If
last_job_nameisNone(e.g. the jobstarter was never used to submit a job), the stats-collection block is skipped entirely andhistoryis not updated, even though the jobstarter type check passes.
Examples
Basic timed run:
timed = SbatchArrayRunnerTimer(CalibySequenceDesign()) poses = timed.run( poses, prefix="sd_round1", nseq=10, jobstarter=SbatchArrayJobstarter(max_cores=50), ) print(timed.history[-1]["total_python_wall_sec"]) # e.g. 312.45 print(timed.history[-1]["overhead_plus_queue_sec"]) # e.g. 18.72 print(timed.history[-1]["runner_class"]) # "CalibySequenceDesign" print(timed.history[-1]["state"]) # "COMPLETED"
Passing runner-specific kwargs transparently:
timed = SbatchArrayRunnerTimer(LigandMPNN()) poses = timed.run( poses, prefix="mpnn", nseq=20, model_type="ligand_mpnn", fixed_residues_col="binding_site", )
Non-SLURM jobstarter (profiling skipped, poses still returned):
from protflow.jobstarters import LocalJobStarter poses = timed.run(poses, prefix="local_test", jobstarter=LocalJobStarter()) # Logs: WARNING - Stats skipped: <class 'LocalJobStarter'> # does not support SLURM accounting. # timed.history is unchanged.
- Parameters:
runner (Runner)
- protflow.runners.col_in_df(df, column)[source]
Checks if a column exists in a DataFrame.
This function verifies whether a specified column is present in the given DataFrame. If the column is not found, it raises a KeyError.
- Parameters:
df (
pandas.DataFrame) – The DataFrame to be checked.column (
str) – The name of the column to be verified.
- Raises:
KeyError – If the specified column is not found in the DataFrame.
- Return type:
None
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> col_in_df(df, 'A') # No error raised >>> col_in_df(df, 'C') # Raises KeyError Traceback (most recent call last): ... KeyError: 'Could not find C in poses dataframe! Are you sure you provided the right column name?'
- protflow.runners.expand_options_flags(options_str, sep='--')[source]
Simple parsing function to parse options and flags from an input string.
Splits an input string into options and flags only based on a specified separator! If your command has more complex patterns in its options, then switch to “regex_expand_options_flags”. Options are key-value pairs, while flags are standalone keys without values.
- Parameters:
- Returns:
A tuple containing a dictionary of options and a set of flags.
- Return type:
tuple[dict,set]
Examples
>>> options_str = "--width 800 --height 600 --verbose" >>> opts, flags = expand_options_flags(options_str) >>> print(opts) {'width': '800', 'height': '600'} >>> print(flags) {'verbose'}
>>> options_str = "--color=blue --debug --timeout=30" >>> opts, flags = expand_options_flags(options_str) >>> print(opts) {'color': 'blue', 'timeout': '30'} >>> print(flags) {'debug'}
- protflow.runners.options_flags_to_string(options, flags, sep='--', no_quotes=False)[source]
Converts options dictionary and flags list into a single string.
This function combines a dictionary of options and a list of flags into a single command-line style string.
- Parameters:
options (
dict) – A dictionary of options, where keys are option names and values are option values.flags (
list) – A list of flags (standalone options without values).sep (
str, optional) – The separator used to distinguish different options and flags (default is “–“).no_quotes (
bool, optional) – (default: False) Setting this option to True will disable the quoting of commandline arguments that are separated by whitespaces. For example, if your option is “–my_list=’1 4 6 14’” then you’d want your list quoted. setting no_quotes=True would result in “–my_list=1 4 6 14”, which can cause errors.
- Returns:
A string representation of the combined options and flags.
- Return type:
Examples
>>> options = {'width': '800', 'height': '600'} >>> flags = ['verbose', 'debug'] >>> options_flags_to_string(options, flags) " --width=800 --height=600 --verbose --debug"
>>> options = {'color': 'dark blue', 'timeout': '30'} >>> flags = ['force'] >>> options_flags_to_string(options, flags) " --color='dark blue' --timeout=30 --force"
- protflow.runners.parse_generic_options(options, pose_options, sep='--')[source]
Parses generic options and pose-specific options from two input strings, combining them into a single dictionary of options and a list of flags. Pose-specific options overwrite generic options in case of conflicts. Options are expected to be separated by a specified separator within each input string, with options and their values separated by spaces.
Parameters:
- optionsstr
A string of generic options, where different options are separated by the specified separator and each option’s value (if any) is separated by space.
- pose_optionsstr
A string of pose-specific options, formatted like the options parameter. These options take precedence over generic options.
- sepstr, optional
The separator used to distinguish between different options in both input strings. Defaults to “–“.
Returns:
- tuple
A 2-element tuple where the first element is a dictionary of merged options (key-value pairs) and the second element is a list of unique flags (options without values) from both input strings.
Examples:
>>> parse_generic_options("--width 800 --height 600", "--color blue --verbose") ({'width': '800', 'height': '600', 'color': 'blue'}, ['verbose'])
This function internally utilizes a helper function expand_options_flags to process each input string separately before merging the results, ensuring that pose-specific options and flags are appropriately prioritized and duplicates are removed.
- protflow.runners.prepend_cmd(cmds, pre_cmd)[source]
Prepends a single command to all commands in a list.
- Parameters:
cmds (
list[str]) – A list of commands, where all elements are strings.pre_cmd (
str) – A string containing a command, which should be prepended to all commands in the commands list.
- Returns:
A list of all commands with the additional command prepended to each.
- Return type:
list[str]
Examples
>>> cmds = [run_inference.sh pose_0001.pdb, run_inference.sh pose_0002.pdb] >>> pre_cmd = "conda init" >>> prepend_cmd(cmds, pre_cmd) "['conda init; run_inference.sh pose_0001.pdb', 'conda init; run_inference.sh pose_0002.pdb']"
- protflow.runners.regex_expand_options_flags(options_str, sep='--')[source]
Parses options and flags from an input string using regular expressions.
This function uses regular expressions to split an input string into options and flags. It ensures that separators within quotes are not split.
- Parameters:
- Returns:
A tuple containing a dictionary of options and a set of flags.
- Return type:
tuple[dict,set]
Examples
>>> options_str = '--width 800 --height 600 --verbose' >>> opts, flags = regex_expand_options_flags(options_str) >>> print(opts) {'width': '800', 'height': '600'} >>> print(flags) {'verbose'}
>>> options_str = '--color="dark blue" --debug --timeout=30' >>> opts, flags = regex_expand_options_flags(options_str) >>> print(opts) {'color': 'dark blue', 'timeout': '30'} >>> print(flags) {'debug'}
Module contents
Package initialization