ProtFlow documentation

Subpackages

protflow.config_template module

This module contains all paths to tools integrated in ProtFlow. PRE_CMD are commands that should be run before the runner is executed (e.g. if import of a specific module is necessary for the environment to work)

protflow.jobstarters module

jobstarters

This module, jobstarters, provides a set of classes and methods to facilitate the submission and management of computing jobs on various job scheduling systems. JobStarters are passed to Runner objects in their .run() methods to facilitate a standardized execution of commands generated by the Runner. JobStarters can also be executed outside of Runner classes as is shown in the examples.

The JobStarter class defines a base JobStarter class with methods that need to be implemented by subclasses to start jobs and wait for their completion.

Overview

The module includes the following classes and methods:

Classes

JobStarter: An abstract base class that defines the interface for all jobstarters.
SbatchArrayJobstarter: A concrete implementation of JobStarter for managing SLURM job arrays.
LocalJobStarter: A concrete implementation of JobStarter for managing local jobs.

Usage

To use a jobstarter, instantiate an appropriate subclass (e.g., SbatchArrayJobstarter) and call its start method with the desired commands and options. Use the wait_for_job method if you need to wait for job completion.

Example

>>> from jobstarters import SbatchArrayJobstarter
>>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")

Note

This module is designed to be extended with additional jobstarters for different scheduling systems as needed. If you want to implement your own JobStarter and need assistance, please contact any of the authors of ProtFlow for assistance. We are happy about every contribution!

class protflow.jobstarters.JobStarter(max_cores=None)[source]

Bases: object

Abstract base class for job starters.

This class defines the interface for all job starters. Subclasses should implement methods to start jobs and wait for their completion. It also includes a method to set the maximum number of cores available for the jobs.

Examples

This class is designed to be extended by other classes that implement specific job scheduling systems.

Example subclass implementation:

class CustomJobStarter(JobStarter):
    def start(self, cmds, jobname, wait, output_path):
        # Implementation for starting jobs
        pass

    def wait_for_job(self, jobname, interval):
        # Implementation for waiting for job completion
        pass

Parameters:: max_cores (int)

__init__(max_cores=None)[source]

Initializes the JobStarter with an optional maximum number of cores.

Parameters:: max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is None.

set_last_error_message(error_path, read_bytes=16384)[source]

Saves content of an error logfile.

Parameters:

error_path (str) – The path to the error logfile.
read_bytes (int, optional) – Defines how many bytes of the log file should be read (starting from the back). Default is 8192.

set_max_cores(cores)[source]

Sets the maximum number of cores available for the jobs.

Parameters:: cores (int) – The maximum number of cores to set.
Return type:: None

start(cmds, jobname, wait, output_path)[source]

Submits a list of commands as jobs to the scheduling system.

Parameters:

cmds (list) – A list of commands to be submitted as jobs.
jobname (str) – The name of the job.
wait (bool) – Whether to wait for the job to complete before proceeding.
output_path (str) – The path where output files should be stored.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

Return type:

None

wait_for_job(jobname, interval)[source]

Waits for a job to complete before proceeding.

Parameters:

jobname (str) – The name of the job to wait for.
interval (float) – The interval (in seconds) at which to check the job status.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

Return type:

None

class protflow.jobstarters.LocalJobStarter(max_cores=1)[source]

Bases: JobStarter

Jobstarter that runs jobs locally using subprocess.run().

This class extends the JobStarter base class to provide functionality for running jobs locally on the machine. It handles the execution of commands using subprocesses, manages the maximum number of concurrent processes, and captures the output and error logs for each command.

Parameters:: max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.
Raises:: ProcessError – If a subprocess crashes during execution.

Examples

Example usage:

>>> from jobstarters import LocalJobStarter
>>> job_starter = LocalJobStarter(max_cores=2)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")

__init__(max_cores=1)[source]

Initializes the LocalJobStarter with an optional parameter for maximum cores.

Parameters:: max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.

start(cmds, jobname, wait=True, output_path='./')[source]

Submits a list of commands to be run locally, managing the execution and logging of each command.

Parameters:

cmds (list) – List of commands to be executed locally.
jobname (str) – Name of the job.
wait (bool, optional) – Whether to wait for all commands to complete before returning. Default is True.
output_path (str, optional) – Path where output files should be stored. Default is None.

Raises:

ProcessError – If a subprocess crashes during execution.

Return type:

None

wait_for_job(jobname, interval)[source]

(No-op) Method for waiting for started jobs.

Parameters:

jobname (str) – Name of the job to wait for.
interval (float) – Interval (in seconds) at which to check the job status.

Return type:

None

class protflow.jobstarters.SbatchArrayJobstarter(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]

Bases: JobStarter

Jobstarter that manages the submission of job arrays to SLURM clusters.

This class extends the JobStarter base class to provide functionality specific to SLURM job arrays. It handles tasks such as generating command files, submitting jobs using sbatch, and waiting for job completion. It also supports options for GPU usage and automatic cleanup of command files after job completion.

Parameters:

max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.
remove_cmdfile (bool, optional) – Whether to remove the command file after job completion. Default is False.
options (str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.
gpus (bool, optional) – Whether to use GPUs for the job. Default is False.
batch_cmds (int)

Raises:

TypeError – If the options parameter is not a string or list.

Examples

Example usage:

>>> from jobstarters import SbatchArrayJobstarter
>>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True, options="--time=10:00", gpus=True)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")

__init__(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]

Initializes the SbatchArrayJobstarter with optional parameters.

Parameters:

max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.
remove_cmdfile (bool, optional) – Whether to remove the command file after job completion. Default is False.
options (str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.
gpus (bool, optional) – Whether to use GPUs for the job. Default is False.
batch_cmds (bool, optional) – Whether to batch the input cmds to the specified number. Default is None.

Note

The options parameter must be set when the Jobstarter is created, not when the .start function is executed.

parse_options(options)[source]

Parses the SBATCH options.

Parameters:: options (object) – SBATCH options in string or list format.
Returns:: Parsed SBATCH options.
Return type:: str
Raises:: TypeError – If the options parameter is not a string or list.

set_options(options, gpus)[source]

Sets the SBATCH options.

Parameters:

options (object) – SBATCH options in string or list format.
gpus (int) – Number of GPUs to be used per node.

Return type:

None

start(cmds, jobname, wait=True, output_path='./', batch_cmds=None)[source]

Writes commands into a command file and starts an SBATCH job running the command file.

Parameters:

cmds (list) – List of commands to be executed as part of the job array.
jobname (str) – Name of the job.
wait (bool, optional) – Whether to wait for the job to complete before returning. Default is True.
output_path (str, optional) – Path where output files should be stored. Default is “./”.
batch_cmds (bool, optional) – Whether to batch the input cmds to the specified number. Default is None.

Raises:

RuntimeError – If the SLURM submission fails.

Return type:

None

wait_for_job(jobname, interval=5)[source]

Waits for SLURM jobs to be finished.

Parameters:

jobname (str) – Name of the job to wait for.
interval (float, optional) – Interval (in seconds) at which to check the job status. Default is 5.

Return type:

None

protflow.jobstarters.add_timestamp(x)[source]

Adds a unique timestamp to a string using the time library.

This function appends a unique timestamp to the given string. The timestamp is generated using the current time, which ensures that the resulting string is unique in most cases. The timestamp is added as a suffix, separated by an underscore.

Parameters:: x (str) – The input string to which the timestamp will be added.
Returns:: The input string with a unique timestamp appended.
Return type:: str

Examples

>>> add_timestamp("jobname")
'jobname_1632417284'

Notes

The timestamp is derived from the current time in seconds since the epoch, with the fractional part of the seconds included to ensure higher precision and uniqueness.

protflow.jobstarters.get_SLURM_stats(job_name, start_time=None)[source]

Query sacct and return aggregated resource statistics for a SLURM job array.

Shells out to the SLURM sacct command to retrieve per-task timing and CPU accounting data for all tasks in the array job identified by job_name. The raw per-task records are aggregated into a single summary dictionary, which is returned to the caller.

Warning

This function must be called from the cluster login node or another host that has the sacct binary in its PATH and access to SLURM’s accounting database. Calling it from within a compute-node job step (e.g. inside a running SLURM batch script) will fail because sacct is not available on compute nodes.

Parameters:

job_name (str) – The SLURM job name to query (passed to sacct --name). This corresponds to the jobname argument supplied to start() and is stored in last_job_name after each submission.
start_time (str, optional) – ISO-8601 datetime string (YYYY-MM-DDTHH:MM:SS) passed to sacct --starttime to restrict results to jobs that began at or after this timestamp. When omitted, sacct returns all matching records regardless of age, which may cause false matches against stale jobs with the same name from earlier sessions. It is strongly recommended to pass the session_start attribute of the enclosing SbatchArrayRunnerTimer to avoid this.

Returns:

A dictionary containing aggregated statistics.

On success, keys include:

job_namestr: The job_name argument echoed back.
total_cpu_secint: Sum of CPUTimeRaw across all tasks.
avg_task_runtime_secfloat: Mean wall-clock elapsed time per task in seconds (2 decimal places).
max_task_runtime_secint: Wall-clock elapsed time of the longest-running task.
min_task_runtime_secint: Wall-clock elapsed time of the shortest-running task.
num_tasksint: Total number of task records returned.
total_cpus_reservedint: Sum of AllocCPUS across all tasks.
statestr: "COMPLETED" or "MIXED (<states>)".
queried_afterstr or None: The start_time argument echoed back.

On failure, keys include:

job_namestr: The job_name argument echoed back.
errorstr: Human-readable description of the failure.

Return type:

dict

Raises:

None – This function does not propagate exceptions. All errors are caught and returned as a dictionary with an error key.

Notes

The sacct command is invoked with -X (suppress sub-step records), --format JobName,ElapsedRaw,CPUTimeRaw,AllocCPUS,State, -n (no header), and -P (pipe-delimited output). The resulting fields are parsed by position.
ElapsedRaw is SLURM’s wall-clock elapsed time for each individual task in seconds; CPUTimeRaw is ElapsedRaw × AllocCPUS and reflects total CPU-core-seconds reserved (not necessarily consumed).
The command is executed as a shell string (shell=True) so that --starttime and other arguments with special characters are handled correctly by the system shell.
Empty or whitespace-only lines in sacct’s stdout are filtered before parsing.
The state aggregation logic is strict: "COMPLETED" is only returned when every task’s state is exactly "COMPLETED" (set equality). A single failed or cancelled task will produce a "MIXED" state.

Examples

Query statistics for a recently submitted job:

from protflow.jobstarters import get_SLURM_stats

stats = get_SLURM_stats("caliby_seqdes", start_time="2025-06-01T12:00:00")
print(stats)

protflow.jobstarters.split_list(input_list, element_length=None, n_sublists=None)[source]

Splits a list into nested sublists with specified lengths or number of sublists.

This function divides the input list into a nested list of sublists. The division can be based on the maximum length of each sublist or the desired number of sublists. Only one of the parameters, element_length or n_sublists, should be specified at a time.

Parameters:

input_list (list) – The list to be split into sublists.
element_length (int, optional) – The maximum length of each sublist. If specified, the input list will be split into sublists each having up to element_length elements.
n_sublists (int, optional) – The desired number of sublists. If specified, the input list will be divided into n_sublists sublists.

Returns:

A nested list containing the sublists.

Return type:

list

Raises:

ValueError – If both element_length and n_sublists are specified or if neither is specified.

Examples

Splitting a list into sublists of a specified maximum length:

>>> split_list([1, 2, 3, 4, 5, 6], element_length=2)
[[1, 2], [3, 4], [5, 6]]

Splitting a list into a specified number of sublists:

>>> split_list([1, 2, 3, 4, 5, 6], n_sublists=3)
[[1, 2], [3, 4], [5, 6]]

Notes

If n_sublists is specified and is greater than the length of the input list, the number of sublists will be equal to the length of the input list.
If neither element_length nor n_sublists is provided, or if both are provided, a ValueError will be raised.

protflow.poses module

poses Module

This module provides functionalities for handling and manipulating protein data within the ProtFlow framework. It focuses on managing protein data represented as Pandas DataFrames, allowing for efficient parsing, storage, and manipulation of protein data across various file formats. The module facilitates complex protein study workflows and integrates seamlessly with other components of the ProtFlow package.

Detailed Description

The poses module offers a robust class, Poses, designed to encapsulate the functionality necessary to manage protein data. It supports various operations such as setting up work directories, parsing protein data, and integrating outputs from different computational processes. The module ensures that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

Key Features

Parsing Protein Data: Supports reading protein data from various file formats like JSON, CSV, Pickle, Feather, and Parquet.
Data Storage and Retrieval: Allows storing and retrieving protein data in multiple formats, facilitating easy data management.
Integration with ProtFlow: Seamlessly integrates with ProtFlow’s job management components, enhancing its utility in distributed computing environments.
Advanced Data Manipulation: Provides functionalities to merge and prefix data from various sources, making it easier to handle complex datasets.
Flexible and Customizable: Users can customize the data handling processes through various parameters, enabling tailored data management solutions.

Usage

To use this module, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Further operations using poses_instance
poses_instance.save_scores('path/to/save/scores')
poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses')

Examples

Here is an example of how to initialize and use the Poses class for managing protein data:

from poses import Poses

# Create an instance of the Poses class
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Perform various operations using the instance
poses_instance.set_work_dir('new/work/dir')
poses_instance.save_scores('path/to/save/scores', out_format='csv')
filtered_poses = poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>')

Further Details

Edge Cases: The module handles various edge cases such as empty pose lists and the need to overwrite previous results. It includes robust error handling and logging for easier debugging and verification.

Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.

Integration: The module integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to manage protein data within their computational workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.poses.Poses(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]

Bases: object

Poses Class

The Poses class within the ProtFlow package is designed for handling protein data, enabling the parsing, storage, and manipulation of protein data represented as Pandas DataFrames. This class facilitates the management of complex protein study workflows and integrates seamlessly with other components of the ProtFlow framework.

Detailed Description

The Poses class encapsulates the functionality necessary for comprehensive management of protein data. It supports various operations, including setting up work directories, parsing protein data from different sources, integrating outputs from different runners, and handling protein data in multiple file formats. This class is essential for users looking to streamline their protein data management within computational workflows.

Key Features

Work Directory Setup: Easily sets up and manages work directories for storing intermediate and final results.
Data Parsing: Parses protein data from various sources and formats, including JSON, CSV, Pickle, Feather, and Parquet.
Data Storage and Retrieval: Stores and retrieves protein data in multiple file formats, ensuring flexibility in data management.
Job Management Integration: Integrates with ProtFlow’s job management components, facilitating the handling of protein data in distributed computing environments.
Advanced Data Manipulation: Supports operations like merging, prefixing, and duplicating data, providing robust data manipulation capabilities.
Filtering and Scoring: Offers methods to filter protein data based on various criteria and calculate composite scores for better data analysis.
Pose Handling: Manages protein poses, including loading, saving, and converting between different formats (e.g., PDB to FASTA).

Usage

To use this class, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Set up the work directory
poses_instance.set_work_dir('path/to/new_work_dir')

# Parse and manipulate poses
poses_instance.set_poses(poses=my_protein_data)
poses_instance.save_scores('path/to/save/scores', out_format='csv')

# Filter poses
filtered_poses = poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses')

# Calculate a composite score
poses_instance.calculate_composite_score(name='composite_score', scoreterms=['score1', 'score2'], weights=[0.5, 0.5], plot=True)

Further Details

Edge Cases: The class handles various edge cases, such as empty pose lists, the need to overwrite previous results, and handling multiline FASTA inputs.

Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.

Integration: The class integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

Error Handling: Includes robust error handling and logging for easier debugging and verification of data processing steps.

- `df`

A DataFrame to store protein data.

Type:: pd.DataFrame

- `work_dir`

The working directory for storing data and results.

Type:: str

- `storage_format`

The format for storing protein data (e.g., ‘json’, ‘csv’).

Type:: str

- `default_jobstarter`

The default job starter for managing jobs.

Type:: JobStarter

Notes

This class is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

__init__(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]

Initializes the Poses class with optional parameters for poses, working directory, storage format, glob suffix, and job starter.

Parameters:

poses (list, optional) – A list of paths to the protein data files to be managed. If not provided, an empty DataFrame is initialized.
work_dir (str, optional) – The working directory where intermediate and final results will be stored. If not provided, the current directory is used.
storage_format (str, optional) – The format used for storing protein data (default is ‘json’). Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.
glob_suffix (str, optional) – A suffix used for globbing multiple files. This allows for batch processing of files matching the given pattern.
jobstarter (JobStarter, optional) – An instance of the JobStarter class used to manage job submissions. The default is an instance of SbatchArrayJobstarter from the jobstarters module.

df

A DataFrame to store protein data.

Type:: pd.DataFrame

work_dir

The working directory for storing data and results.

Type:: str

storage_format

The format for storing protein data.

Type:: str

default_jobstarter

The default job starter for managing jobs.

Type:: JobStarter

Notes

This method initializes the Poses class and sets up various attributes required for managing protein data. It prepares the environment for subsequent data manipulation and analysis operations.

Example

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

calculate_composite_score(name, scoreterms, weights, plot=False, scale_output=False)[source]

Calculates a composite score from specified score columns, applying weights and normalization, and optionally generates a plot.

Parameters:

name (str) – The name of the new composite score column to be created.
scoreterms (list[str]) – The list of score columns to be included in the composite score.
weights (list[float]) – The list of weights corresponding to each score column.
plot (bool, optional) – If True, generates a plot of the composite score and the individual score terms (default is False).
scale_output (bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).

Returns:

The updated Poses instance with the new composite score column.

Return type:

Poses

Raises:

ValueError – If the number of scoreterms and weights do not match.
TypeError – If any score column contains non-numeric values.

Further Details

This method calculates a composite score from multiple score columns by applying the specified weights and normalizing the columns. The normalization process involves subtracting the median and dividing by the standard deviation for each score column. Optionally, the composite score can be scaled to a range between 0 and 1.

The method ensures that each score column contains numeric values and applies the normalization process as follows: 1. Calculate the median and standard deviation of each score column. 2. Normalize the column by subtracting the median and dividing by the standard deviation. 3. Optionally scale the normalized values to a range between 0 and 1.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate a composite score
poses_instance.calculate_composite_score(
    name='composite_score',
    scoreterms=['score1', 'score2'],
    weights=[0.5, 0.5],
    plot=True,
    scale_output=True
)

Notes

The method ensures that the number of scoreterms and weights match.
Normalization helps in making the scores comparable by removing scale differences.
Generates a violin plot if the plot parameter is set to True, showing the distribution of the composite score and individual score terms.

calculate_max_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the maximum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.

Parameters:

name (str) – The name of the new column where the maximum values will be stored.
score_col (str) – The name of the column from which to calculate the maximum value.
skipna (bool, optional) – Whether to skip NA/null values. Default is False.
remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.
sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the maximum values added to the DataFrame.

Return type:

self

Raises:

TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the maximum values
poses_instance.calculate_max_score(
    name='max_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)

calculate_mean_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the mean score of the selected score column. If remove_layers is set, calculates mean scores over poses grouped by the description column with the set number of index layers removed.

Parameters:

name (str) – The name of the new column where the mean scores will be stored.
score_col (str) – The name of the column from which to calculate the mean scores.
skipna (bool, optional) – Whether to skip NA/null values. Default is False.
remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.
sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:

TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the mean score
poses_instance.calculate_mean_score(
    name='mean_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)

calculate_median_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the median score of the selected score column. If remove_layers is set, calculates median scores over poses grouped by the description column with the set number of index layers removed.

Parameters:

name (str) – The name of the new column where the mean scores will be stored.
score_col (str) – The name of the column from which to calculate the median scores.
skipna (bool, optional) – Whether to skip NA/null values. Default is False.
remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.
sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:

TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the median score
poses_instance.calculate_median_score(
    name='median_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)

calculate_min_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the minimum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.

Parameters:

name (str) – The name of the new column where the minimum values will be stored.
score_col (str) – The name of the column from which to calculate the minimum value.
skipna (bool, optional) – Whether to skip NA/null values. Default is False.
remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.
sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the minimum values added to the DataFrame.

Return type:

self

Raises:

TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the minimum values
poses_instance.calculate_min_score(
    name='min_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)

calculate_std_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the standard deviation of the selected score column. If remove_layers is set, calculates standard deviations over poses grouped by the description column with the set number of index layers removed.

Parameters:

name (str) – The name of the new column where the mean scores will be stored.
score_col (str) – The name of the column from which to calculate the standard deviation.
skipna (bool, optional) – Whether to skip NA/null values. Default is False.
remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.
sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:

TypeError – If remove_layers is not an integer.
ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the standard deviation
poses_instance.calculate_std_score(
    name='mean_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)

change_poses_dir(poses_dir, copy=False, overwrite=False)[source]

Changes the directory of the stored poses, with options to copy or overwrite existing poses.

Parameters:

poses_dir (str) – The new directory where the poses will be located.
copy (bool, optional) – If True, the poses will be copied to the new directory (default is False).
overwrite (bool, optional) – If True, existing files in the new directory will be overwritten (default is False).

Returns:

Poses – The updated Poses instance with poses located in the new directory.
Further Details
---------------
This method updates the paths of the stored poses to a new directory. If the `copy parameter is set` to True, the poses are copied to the new directory. The `overwrite parameter controls whether existing files in the new directory are overwritten.`

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Change the directory of the poses
poses_instance.change_poses_dir('path/to/new_poses_dir', copy=True, overwrite=True)

Notes

If copy is set to False, the method only updates the paths in the DataFrame without moving the files.
Raises a ValueError if the new directory does not exist or if the poses do not exist in the specified directory (when copy is False).
Ensures the integrity of the poses by verifying their existence in the new directory.

check_poses_df_integrity(df)[source]

Checks the integrity of the poses DataFrame, ensuring it contains necessary columns.

Parameters:: df (pd.DataFrame) – The DataFrame to be checked for integrity.
Returns:: The validated poses DataFrame.
Return type:: pd.DataFrame
Raises:: KeyError – If the DataFrame does not contain the mandatory columns ‘input_poses’, ‘poses’, and ‘poses_description’.

Further Details

This method verifies that the poses DataFrame contains the necessary columns required for proper functioning. It ensures that the DataFrame has ‘input_poses’, ‘poses’, and ‘poses_description’ columns, which are essential for various operations.

Example

from poses import Poses
import pandas as pd

# Initialize the Poses class
poses_instance = Poses()

# Create a sample DataFrame
sample_df = pd.DataFrame({
    'input_poses': ['path/to/pose1.pdb'],
    'poses': ['path/to/pose1.pdb'],
    'poses_description': ['pose1']
})

# Check the integrity of the DataFrame
validated_df = poses_instance.check_poses_df_integrity(sample_df)

Notes

The method raises a KeyError if any of the mandatory columns are missing.
Ensures that the DataFrame is properly structured for further data manipulation and analysis.

check_prefix(prefix)[source]

Checks if the given prefix is already used in the poses DataFrame.

Parameters:: prefix (str) – The prefix to be checked in the poses DataFrame.
Raises:: KeyError – If the prefix is already used in the poses DataFrame.
Return type:: None

Further Details

This method verifies whether the specified prefix is already in use within the poses DataFrame. It is useful for ensuring that new prefixes do not conflict with existing ones, maintaining data integrity.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Check if a prefix is already used
poses_instance.check_prefix('new_prefix')

Notes

The method raises a KeyError if the prefix is found in the DataFrame, indicating a conflict.
Ensures that new prefixes are unique and can be safely used for new columns or attributes.

convert_pdb_to_fasta(prefix, update_poses=False, chain_sep=':')[source]

Converts PDB pose files to FASTA format and optionally updates the poses. Paths to fasta location are saved in poses dataframe under column <prefix>_fasta_location.

Parameters:

prefix (str) – The prefix used for naming the output FASTA files.
update_poses (bool, optional) – If True, updates the poses DataFrame to use the new FASTA files (default is False).
chain_sep (str, optional) – The separator used for chain identifiers in the FASTA file (default is “:”).

Raises:

RuntimeError – If the poses are not of type PDB.

Return type:

None

Further Details

This method converts PDB pose files to FASTA format and stores them in a directory named with the given prefix. It can also update the poses DataFrame to use the new FASTA files if specified.

Example

from poses import Poses

# Initialize the Poses class with some PDB poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Convert the PDB files to FASTA format
poses_instance.convert_pdb_to_fasta(prefix='converted', update_poses=True)

Notes

The method checks that the poses are of type PDB before conversion.
Creates a new directory within the working directory to store the FASTA files.
Logs the conversion process and verifies the creation of FASTA files.

convert_resselection_cols(resselection_col='import_resselection_cols')[source]

Converts per-row residue selection descriptors into ResidueSelection objects for the columns listed in a list-like selector column, mutating the DataFrame in place.

Parameters:

resselection_col (str, optional) – Name of the column that, for each row, contains a list/tuple of target column names to convert (default is import_resselection_cols). When reading from CSV, this field may be a stringified list (e.g., ['a','b']), which will be parsed automatically.

Returns:

This method modifies self.df in place and returns None. If resselection_col is not present in self.df, the method exits early.

Return type:

None

Raises:

KeyError – If a row’s value in resselection_col exists but is not a list or tuple (after optional string-to-list parsing).
ValueError – If parsing a stringified list with ast.literal_eval fails due to an invalid literal.
SyntaxError – If parsing a malformed stringified list triggers a syntax error.
TypeError – If constructing a ResidueSelection from a cell value raises a type error.

Further Details

For each row, the method reads the list of target column names from resselection_col and attempts to convert the corresponding cells:

If a target column listed for a row does not exist in self.df, a warning is logged and that column is skipped for the row.
If the target cell is already a ResidueSelection instance, it is left unchanged.
If the target cell is a str, it is converted via ResidueSelection(value) (useful for CSV imports).
If the target cell is a dict, it is converted via ResidueSelection(value, from_scorefile=True) (useful for JSON imports).
Empty selector lists are allowed and simply result in no action for that row.
Cells that are falsy (e.g., None, empty string, empty dict) are skipped.

Example

import pandas as pd
from protflow.poses import poses

# Sample DataFrame where each row specifies which columns to convert
df = pd.DataFrame({
    "import_resselection_cols": [
        ["fixed_residues", "motif_residues"],  # row 0: convert two columns
        "['motif_residues']",                  # row 1: stringified list (from CSV)
        []                                     # row 2: nothing to convert
    ],
    "fixed_residues": [
        "A12,A34,A56",        # str -> ResidueSelection(str)
        None,                 # skipped
        "A1"
    ],
    "motif_residues": [
        {"residues":[["A",164],["A",165],["A",166],["A",167]]},  # dict -> ResidueSelection(dict, from_scorefile=True)
        "B5-B9",                           # str -> ResidueSelection(str)
        {}
    ]
})

poses = Poses(df)
poses.convert_resselection_cols()  # mutates poses.df in place

# After this call:
# - df.loc[0, "fixed_residues"] is a ResidueSelection instance
# - df.loc[0, "motif_residues"] is a ResidueSelection instance (from dict)
# - df.loc[1, "motif_residues"] is a ResidueSelection instance
# - Row 2 remains unchanged due to empty selector and falsy cells

Notes

Missing target columns are not fatal; a warning is logged and processing continues.
When importing from CSV, stringified lists in resselection_col are parsed with ast.literal_eval; malformed strings will raise ValueError or SyntaxError.
ResidueSelection construction is delegated; any errors it raises will propagate.

determine_pose_type(pose_col=None)[source]

Determines the file types of the poses based on their extensions.

Parameters:

pose_col (str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).

Returns:

list – A list of unique file extensions found in the pose file paths.
Further Details
---------------
This method extracts and identifies the file extensions of the pose file paths in the specified column. It returns a list of unique file extensions, which helps in understanding the types of files being managed.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Determine the pose file types
pose_types = poses_instance.determine_pose_type()

Notes

The method logs a warning if multiple file extensions are found.
If no file extensions are found, it logs a warning indicating the inability to determine file types.
Ensures that the returned list contains only unique file extensions.

duplicate_poses(output_dir, n_duplicates, overwrite=False)[source]

Duplicates poses a specified number of times and saves them to an output directory.

Parameters:

output_dir (str) – The directory where the duplicated poses will be saved.
n_duplicates (int) – The number of duplicates to create for each pose.
Details (Further)
---------------
them. (This method creates multiple copies of each pose file and saves them to the specified output directory. The duplicated files are named with an incremented index to distinguish)
overwrite (bool)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Duplicate the poses
poses_instance.duplicate_poses(output_dir='path/to/duplicates', n_duplicates=3)

Notes

The method creates the output directory if it does not exist.
Ensures that the duplicated files have unique names by appending an index.
Logs the duplication process and verifies the creation of duplicate files.

filter_poses_by_rank(n, score_col, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None)[source]

Filters poses based on their rank in a specified score column, with options to handle layers and generate plots.

Parameters:

n (float) – The number of top-ranked poses to keep. If n < 1, it represents a fraction of the total poses.
score_col (str) – The column in the DataFrame containing the scores used for ranking.
group_col (str, optional) – Group dataframe by this column and filter individual groups.
remove_layers (int, optional) – The number of layers to remove from the pose descriptions before ranking. This helps in grouping similar poses.
layer_col (str, optional) – The column used for layer-based grouping of poses (default is “poses_description”).
sep (str, optional) – The separator used in the layer descriptions (default is “_”).
ascending (bool, optional) – If True, ranks poses in ascending order of scores; otherwise, in descending order (default is True).
prefix (str, optional) – The prefix used for naming the output filtered poses file and plot.
plot (bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).
plot_cols (list[str], optional) – Add additional plotting data to the output filtering plot.
overwrite (bool, optional) – If True, overwrites existing filtered poses files (default is True).
storage_format (str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).

Returns:

Poses – The updated Poses instance with filtered poses.
Further Details
---------------
This method filters the poses DataFrame to retain only the top-ranked poses based on their scores. It supports fractional ranking, layer-based grouping, and optional plot generation for visualizing the filtering process. The filtered poses can be saved to a file with a specified prefix and storage format.

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Filter poses by rank
poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='top_poses', plot=True)

Notes

The method creates a filtered poses file and an optional plot in the specified working directory.
Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.
Logs the filtering process, including any errors or warnings related to the ranking criteria.

filter_poses_by_value(score_col, value, operator, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None, fail_on_empty=True)[source]

Filters poses based on a specified value in a score column, with options to generate plots.

Parameters:

score_col (str) – The column in the DataFrame containing the scores used for filtering.
value (float or int) – The value used as the threshold for filtering poses.
operator (str) – The comparison operator used for filtering (‘>’, ‘>=’, ‘<’, ‘<=’, ‘=’, ‘!=’).
prefix (str, optional) – The prefix used for naming the output filtered poses file and plot.
plot (bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).
plot_cols (list[str], optional) – Add additional plotting data to the output filtering plot.
overwrite (bool, optional) – If True, overwrites existing filtered poses files (default is True).
storage_format (str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).
fail_on_empty (bool)

Returns:

The updated Poses instance with filtered poses.

Return type:

Poses

Raises:

ValueError – If all poses are removed based on the filtering criteria.

Further Details

This method filters the poses DataFrame based on a specified value in a score column, using the provided comparison operator. It supports optional plot generation for visualizing the filtering process and allows saving the filtered poses to a file with a specified prefix and storage format.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Filter poses by value
poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>', prefix='filtered_poses', plot=True)

Notes

The method creates a filtered poses file and an optional plot in the specified working directory.
Ensures that the DataFrame is properly filtered based on the provided criteria.
Logs the filtering process, including any errors or warnings related to the filtering criteria.
Raises a ValueError if the filtering criteria remove all poses, ensuring that the Poses instance retains valid data.

get_pose(pose_description, all_models=False)[source]

Retrieves a pose structure based on its description.

Parameters:

pose_description (str) – The description of the pose to be retrieved.
all_models (bool, optional) – If all models in the input PDB should be returned (all_models = True) or just the first (all_models = False). If False, a Bio.PDB Model is returned, if True, a Bio.PDB Structure is returned.

Returns:

The Bio.PDB Model or Structure object corresponding to the specified pose description.

Return type:

Bio.PDB.Model.Model or Bio.PDB.Structure.Structure

Raises:

KeyError – If the pose description is not found in the poses DataFrame.

Further Details

This method locates the pose file based on its description and loads it as a Bio.PDB Structure object. It is useful for accessing specific pose structures for further analysis or manipulation.

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Retrieve a specific pose structure
pose_structure = poses_instance.get_pose('pose1')

Notes

The method uses the ‘poses_description’ column to locate the specified pose.
Ensures that the returned pose is loaded as a Bio.PDB Structure object for further processing.

load_poses(poses_path)[source]

Loads poses from a specified file and updates the Poses instance.

Parameters:

poses_path (str) – The path to the file containing the poses to be loaded.

Returns:

Poses – The updated Poses instance with poses loaded from the specified file.
Further Details
---------------
This method reads a file containing poses and updates the Poses instance with the data. The file format is automatically detected based on the file extension, and the corresponding loading function is used to read the data into a DataFrame.

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Load poses from a file
poses_instance.load_poses('path/to/poses.json')

Notes

The method supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.

parse_descriptions(poses=None)[source]

Parses descriptions from the provided pose file paths.

Parameters:

poses (list, optional) – A list of pose file paths from which descriptions are extracted.

Returns:

list – A list of descriptions parsed from the pose file paths.
Further Details
---------------
This method extracts descriptions from the provided list of pose file paths. Descriptions are derived from the file names by stripping the directory path and file extension.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Parse descriptions from pose file paths
descriptions = poses_instance.parse_descriptions(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

Notes

This method is useful for generating a list of concise descriptions based on file names.
Ensures that descriptions are derived in a consistent format, suitable for use in data management and analysis.

parse_poses(poses=None, glob_suffix=None)[source]

Parses the input poses, which can be provided as a list or a directory with a glob suffix.

Parameters:

poses (Union[list, str], optional) – A list of file paths or a directory containing the protein data files. If not provided, an empty list is returned.
glob_suffix (str, optional) – A suffix used for globbing multiple files in the specified directory.

Returns:

list – A list of parsed pose file paths.
Further Details
---------------
This method handles various input types for parsing poses. It can parse a list of file paths directly or glob files in a specified directory using a suffix. The method ensures that all specified files exist and raises appropriate errors if they do not.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Parse poses from a directory with a glob suffix
parsed_poses = poses_instance.parse_poses(poses='path/to/pose_dir', glob_suffix='*.pdb')

Notes

Raises FileNotFoundError if any specified files do not exist.
Supports both single file and multiple file (via globbing) inputs.
Ensures that the returned list contains valid file paths.

poses_list()[source]

Returns a list of pose file paths from the DataFrame.

Returns:

list – A list of pose file paths.
Further Details
---------------
This method extracts the pose file paths from the 'poses' column of the DataFrame and returns them as a list. It provides a convenient way to access the stored pose file paths.

Return type:

list[str]

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Get the list of pose file paths
pose_paths = poses_instance.poses_list()

Notes

The method assumes that the ‘poses’ column exists in the DataFrame.
Provides a simple way to retrieve all pose file paths managed by the Poses instance.

reindex_poses(prefix, group_col=None, remove_layers=None, force_reindex=False, sep='_', overwrite=False)[source]

Removes index layers from poses. Saves reindexed poses to an output directory.

Parameters:

prefix (str) – The directory where the duplicated poses will be saved and the prefix for the DataFrame columns containing the original paths and descriptions.
group_col (str, optional) – The poses dataframe column on which to group to create new descriptions. Must be a column in ‘poses_description’ or ‘poses’ format (e.g. from a previous state, before runners appended index layers)
remove_layers (int, optional) – The number of index layers to remove.
force_reindex (bool, optional) – Add a new index layer to all poses.
sep (str, optional) – The separator used to split the description column into layers.
Details (Further)
---------------
(_0001 (This method removes index layers from poses)
_0002
provided (etc). If a group column is)
0 (the poses are assigned names according to the group. If remove_layers is above)
accordingly. (subtracts the set number of layers from the description column and groups the poses)
True (If force_reindex is)
poses. (adds one index layer to all)
overwrite (bool)

Return type:

None

Notes

The method creates the output directory if it does not exist.
Raises a KeyError if both group_col and remove_layers are set..
Raises a RuntimeError if multiple poses with identical description after index layer removal are found and force_reindex is False..

reset_poses(new_poses_col='input_poses', force_reset_df=False)[source]

Resets the poses DataFrame to the original input poses, with an option to force reset.

Parameters:

new_poses_col (str, optional) – The column in the DataFrame containing the new pose file paths (default is ‘input_poses’).
force_reset_df (bool, optional) – If True, forces a reset of the DataFrame even if the number of new poses does not match the original (default is False).
Details (Further)
---------------
parameter. (This method resets the poses DataFrame to use the original input poses. It handles multiline FASTA inputs and ensures that the DataFrame structure is preserved or reset based on the force_reset_df)

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Reset the poses to the original input poses
poses_instance.reset_poses()

Notes

The method ensures that the new poses are unique and properly formatted.
Raises a RuntimeError if the number of new poses does not match the original and force_reset_df is False.
Logs warnings and information about the reset process, ensuring data integrity.

save_poses(out_path, poses_col='poses', overwrite=True)[source]

Saves the poses to a specified directory, with an option to overwrite existing files.

Parameters:

out_path (str) – The directory where the poses will be saved.
poses_col (str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).
overwrite (bool, optional) – If True, existing files in the target directory will be overwritten (default is True).
Details (Further)
---------------
directory (This method saves the pose files to the specified directory. It copies the pose files from their current locations to the new)
overwritten. (ensuring that the directory structure is maintained. The overwrite parameter controls whether existing files in the target directory are)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Save poses to a new directory
poses_instance.save_poses(out_path='path/to/new_poses_dir', overwrite=False)

Notes

The method ensures that the target directory exists, creating it if necessary.
If overwrite is set to False, the method skips saving poses that already exist in the target directory.
Logs the saving process, including any skipped files due to the overwrite setting.

save_scores(out_path=None, out_format=None)[source]

Saves the scores DataFrame to a specified file path in the desired format.

Parameters:

out_path (str, optional) – The file path where the scores will be saved. If not provided, the default scorefile path is used.
out_format (str, optional) – The format in which to save the scores. If not provided, the default storage format is used.
Details (Further)
---------------
necessary. (This method saves the scores DataFrame to the specified file path in the desired format. It ensures that the file name conforms to the specified format by appending the correct file extension if)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Save scores to a specific path in CSV format
poses_instance.save_scores(out_path='path/to/scores.csv', out_format='csv')

Notes

Supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
The method automatically appends the correct file extension if it is not already present in the out_path.
Ensures that the scores are saved in a format suitable for further analysis and processing.

set_jobstarter(jobstarter)[source]

Configures the job starter for managing job submissions.

Parameters:: jobstarter (JobStarter) – An instance of the JobStarter class used to manage job submissions.
Return type:: None

Further Details

This method sets the job starter for the Poses class, which is used to manage job submissions in distributed computing environments. It allows the user to specify a custom job starter for handling computational tasks.

Example

from poses import Poses
from protflow.jobstarters import CustomJobStarter

# Initialize the Poses class
poses_instance = Poses()

# Set a custom job starter
custom_jobstarter = CustomJobStarter()
poses_instance.set_jobstarter(custom_jobstarter)

Notes

The job starter must be an instance of the JobStarter class or a subclass thereof.
This method enables customization of job management to suit specific computational workflows.

set_logger()[source]

Configures the logger for the Poses class.

Further Details

This method sets up the logging configuration for the Poses class. It creates a logger that outputs log messages to both the console and a log file in the working directory (if set). This aids in debugging and tracking the progress of data processing operations.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses(work_dir='path/to/work_dir')

# Set up the logger
poses_instance.set_logger()

Notes

The log file is named after the working directory and stored within it.
The logging level is set to INFO, and log messages include timestamps, logger names, log levels, and messages.

Return type:: None

set_motif(motif_col)[source]

Sets a motif column in the poses DataFrame for further analysis.

Parameters:

motif_col (str) – The column in the DataFrame containing the motifs to be set.

Raises:

KeyError – If the specified motif column is not found in the poses DataFrame.
TypeError – If the objects in the specified motif column are not of type ResidueSelection.

Return type:

None

Further Details

This method sets a column in the poses DataFrame to be used as motifs for further analysis. The motifs must be instances of the ResidueSelection class.

Example

from poses import Poses
from protflow.residues import ResidueSelection

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Assume we have a column 'motifs' with ResidueSelection objects
poses_instance.set_motif('motifs')

Notes

The method ensures that the specified column exists and contains ResidueSelection objects.
Logs any errors encountered during the process for easier debugging and verification.

set_poses(poses=None, glob_suffix=None)[source]

Sets the poses for the Poses instance, parsing the input if necessary.

Parameters:

poses (Union[list, str, pd.DataFrame], optional) – A list of file paths, a directory containing the protein data files, or a DataFrame containing the poses. If not provided, an empty DataFrame is initialized.
glob_suffix (str, optional) – A suffix used for globbing multiple files in the specified directory.
Details (Further)
---------------
types (This method initializes the poses for the Poses instance. It can accept various input)
paths (including a list of file)
files (a directory for globbing)
processing. (or a DataFrame. The method ensures that the poses are correctly parsed and set up for further)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set poses from a directory with a glob suffix
poses_instance.set_poses(poses='path/to/pose_dir', glob_suffix='*.pdb')

# Set poses from a list of file paths
poses_instance.set_poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

Notes

If a DataFrame is provided, it is directly used as the poses DataFrame after integrity checks.
The method supports parsing multiline FASTA inputs and handles them appropriately.
Ensures that the poses DataFrame contains necessary columns for subsequent operations.

set_scorefile(work_dir)[source]

Sets the scorefile path for storing protein scores.

Parameters:: work_dir (str) – The working directory where the scorefile will be stored. If the work directory is not set, the scorefile is stored in the current directory.
Return type:: None

scorefile

The path to the scorefile where protein scores are stored.

Type:: str

Notes

This method configures the path for the scorefile based on the provided working directory. If no working directory is specified, the scorefile is stored in the current directory.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the scorefile path
poses_instance.set_scorefile(work_dir='path/to/work_dir')

set_storage_format(storage_format)[source]

Sets the storage format for storing protein data.

Parameters:: storage_format (str) – The format used for storing protein data. Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.
Raises:: KeyError – If the provided storage format is not supported.
Return type:: None

Notes

This method configures the storage format for protein data. It ensures that the format is one of the supported formats and raises an error if the format is invalid.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the storage format to 'csv'
poses_instance.set_storage_format('csv')

set_work_dir(work_dir, set_scorefile=True)[source]

Sets up and configures the working directory for storing data and results.

Parameters:

work_dir (str) – The working directory where data and results will be stored. If the directory does not exist, it will be created.
set_scorefile (bool, optional) – If True, also sets the path for the scorefile in the specified working directory (default is True).

Return type:

None

Further Details

This method creates the necessary subdirectories within the specified working directory to organize score files, filter results, and plots. It ensures that the required directory structure is in place for subsequent data management operations.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the working directory
poses_instance.set_work_dir('path/to/new_work_dir')

Notes

The method will log the creation of directories if they do not already exist.
If set_scorefile is set to True, the scorefile path will be configured within the working directory.

split_multiline_fasta(path, encoding='UTF-8')[source]

Splits a multiline FASTA file into individual FASTA files, each containing a single sequence.

Parameters:

path (str) – The path to the multiline FASTA file.
encoding (str, optional) – The encoding of the FASTA file (default is “UTF-8”).

Returns:

list[str] – A list of file paths to the individual FASTA files.
Further Details
---------------
This method reads a multiline FASTA file and splits it into individual FASTA files, each containing a single sequence. The individual FASTA files are stored in a subdirectory named 'input_fastas_split' within the working directory.

Return type:

list[str]

Example

from poses import Poses

# Initialize the Poses class with a working directory
poses_instance = Poses(work_dir='path/to/work_dir')

# Split a multiline FASTA file
individual_fasta_paths = poses_instance.split_multiline_fasta('path/to/multiline.fasta')

Notes

The method creates a subdirectory named ‘input_fastas_split’ within the working directory to store the individual FASTA files.
The descriptions in the FASTA file are sanitized to replace special characters with underscores.
Raises an AttributeError if the working directory is not set.

Parameters:

poses (list)
work_dir (str)
storage_format (str)
glob_suffix (str)
jobstarter (JobStarter)

protflow.poses.class_in_df(df, cls, out_col)[source]

Return a copy of df with a column listing, for each row, the names of columns whose values are instances of a given class (or classes).

If no cells in the DataFrame match cls, the function returns a copy of df without adding out_col. Empty DataFrames are returned unchanged. Elementwise checks use pandas.DataFrame.map() (pandas ≥ 2.2).

Parameters:

df (pandas.DataFrame) – Input DataFrame to inspect.
cls (type or tuple[type, ]) – Class (or tuple of classes) to test against, as in isinstance(). Examples: dict or (dict, list).
out_col (str) – Name of the output column to add. Each entry will be a list[str] of column names whose values in that row are instances of cls. The column is only created if at least one match exists anywhere in df.

Returns:

A copy of df. If any matches are found, the copy contains an added column out_col with per-row lists of matching column names. If no matches are found (or df is empty), the copy is returned unchanged.

Return type:

pandas.DataFrame

Notes

This function does not mutate df; it returns a modified copy.
cls behaves exactly like the second argument to isinstance().
To convert the list results to a delimiter-separated string, you can post-process with: out[out_col] = out[out_col].apply('|'.join).

Examples

import pandas as pd
df = pd.DataFrame({
    'a': [1, {'x': 1}, 3],
    'b': [{'y': 2}, 5, [1, 2]],
    'c': ['hi', 'there', 'world'],
})

check_cols_for_class(df, dict, 'resselector_cols')

protflow.poses.col_in_df(df, column)[source]

Checks if the specified column(s) exist in the DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame to be checked.
column (str or list[str]) – The column name or list of column names to check for existence in the DataFrame.

Raises:

KeyError – If any of the specified columns are not found in the DataFrame.

Return type:

None

Further Details

This function checks whether the specified column or list of columns exist in the given DataFrame. It is useful for ensuring that the DataFrame contains the necessary columns before performing further operations.

Example

import pandas as pd
from poses import col_in_df

# Create a sample DataFrame
df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6]
})

# Check if a column exists
col_in_df(df, 'col1')

# Check if multiple columns exist
col_in_df(df, ['col1', 'col2'])

Notes

The function raises a KeyError if any of the specified columns are not found in the DataFrame.
Ensures that the DataFrame contains the necessary columns for subsequent operations.

protflow.poses.combine_dataframe_score_columns(df, scoreterms, weights, scale=False)[source]

Combines multiple score columns in a DataFrame into a single composite score, applying weights and normalization.

Parameters:

df (pd.DataFrame) – The DataFrame containing the score columns.
scoreterms (list[str]) – The list of score columns to be combined.
weights (list[float]) – The list of weights corresponding to each score column.
scale (bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).

Returns:

The composite score as a pandas Series.

Return type:

pd.Series

Raises:

ValueError – If the number of scoreterms and weights do not match.
TypeError – If any score column contains non-numeric values.

Further Details

This function combines multiple score columns in a DataFrame into a single composite score. Each score column is normalized by subtracting the median and dividing by the standard deviation. The normalized scores are then weighted according to the specified weights and summed to create the composite score. Optionally, the composite score can be scaled to a range between 0 and 1.

Example

import pandas as pd
from poses import combine_dataframe_score_columns

# Create a sample DataFrame
data = {
    'score1': [10, 20, 30, 40, 50],
    'score2': [15, 25, 35, 45, 55]
}
df = pd.DataFrame(data)

# Combine score columns into a composite score
composite_score = combine_dataframe_score_columns(df, scoreterms=['score1', 'score2'], weights=[0.5, 0.5], scale=True)

Notes

The method ensures that the number of scoreterms and weights match.
Normalization helps in making the scores comparable by removing scale differences.
Raises a ValueError if the number of scoreterms and weights do not match, ensuring correct input.
The optional scaling step ensures that the composite score remains within a standardized range.

protflow.poses.description_from_path(path)[source]

Extracts “description” from a pose path.

Parameters:: path (str)
Return type:: str

protflow.poses.filter_dataframe_by_rank(df, col, n, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True)[source]

Filters the DataFrame to retain only the top-ranked rows based on a specified column.

Parameters:

df (pd.DataFrame) – The DataFrame to be filtered.
col (str) – The column in the DataFrame used for ranking.
n (Union[float, int]) – The number of top-ranked rows to retain. If n < 1, it represents a fraction of the total rows.
group_col (str, optional) – Group dataframe by this column, then filter individual groups.
remove_layers (int, optional) – The number of layers to remove from the column values before ranking. This helps in grouping similar rows.
layer_col (str, optional) – The column used for layer-based grouping of rows (default is “poses_description”).
sep (str, optional) – The separator used in the layer descriptions (default is “_”).
ascending (bool, optional) – If True, ranks rows in ascending order; otherwise, in descending order (default is True).

Returns:

pd.DataFrame – The filtered DataFrame containing only the top-ranked rows.
Further Details
---------------
This function filters the DataFrame to retain only the top-ranked rows based on the values in a specified column. It supports fractional ranking, layer-based grouping, and sorting in ascending or descending order. The function also allows for removing layers from column values before ranking to handle grouped data.

Return type:

DataFrame

Example

import pandas as pd
from poses import filter_dataframe_by_rank

# Create a sample DataFrame
data = {
    'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'],
    'score': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filter the DataFrame to retain the top 3 rows based on the score column
filtered_df = filter_dataframe_by_rank(df, col='score', n=3)

Notes

The function raises a KeyError if the specified column is not found in the DataFrame.
Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.

protflow.poses.filter_dataframe_by_value(df, col, value, operator)[source]

Filters the DataFrame based on a specified value in a column using the provided comparison operator.

Parameters:

df (pd.DataFrame) – The DataFrame to be filtered.
col (str) – The column in the DataFrame used for filtering.
value (Union[float, int]) – The value used as the threshold for filtering rows.
operator (str) – The comparison operator used for filtering (‘>’, ‘>=’, ‘<’, ‘<=’, ‘=’, ‘!=’).

Returns:

pd.DataFrame – The filtered DataFrame containing only the rows that meet the filtering criteria.
Further Details
---------------
This function filters the DataFrame based on a specified value in a column, using the provided comparison operator. It supports various comparison operators such as greater than, less than, equal to, and not equal to.

Return type:

DataFrame

Example

import pandas as pd
from poses import filter_dataframe_by_value

# Create a sample DataFrame
data = {
    'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'],
    'score': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filter the DataFrame to retain rows where the score is greater than 30
filtered_df = filter_dataframe_by_value(df, col='score', value=30, operator='>')

Notes

The function raises a KeyError if the specified column is not found in the DataFrame.
Ensures that the DataFrame is properly filtered based on the provided criteria.

protflow.poses.get_format(path)[source]

Returns the appropriate pandas function to load a file based on its extension.

Parameters:

path (str) – The path to the file whose format needs to be determined.

Returns:

function – The pandas function corresponding to the file format (e.g., pd.read_json, pd.read_csv).
Further Details
---------------
This function determines the appropriate pandas function to use for loading a file based on its extension. It supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.

Example

import pandas as pd
from poses import get_format

# Determine the format function for a JSON file
load_function = get_format('path/to/data.json')

# Use the function to load the data
df = load_function('path/to/data.json')

Notes

Raises a KeyError if the file format is not supported.
Ensures that the appropriate pandas function is returned based on the file extension.

protflow.poses.load_poses(poses_path)[source]

Loads poses from a specified file and returns a Poses instance.

Parameters:

poses_path (str) – The path to the file containing the poses to be loaded.

Returns:

Poses – A Poses instance with poses loaded from the specified file.
Further Details
---------------
This function reads a file containing poses and returns a Poses instance with the data. The file format is automatically detected based on the file extension, and the corresponding loading function is used to read the data into a DataFrame.

Return type:

Poses

Example

from poses import Poses, load_poses

# Load poses from a file
poses_instance = load_poses('path/to/poses.json')

Notes

The function supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.
Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.

protflow.poses.normalize_series(ser, scale=False)[source]

Normalizes a pandas Series by subtracting the median and dividing by the standard deviation, with an option to scale the values.

Parameters:

ser (pd.Series) – The pandas Series to be normalized.
scale (bool, optional) – If True, scales the normalized values to a range between 0 and 1 (default is False).

Returns:

pd.Series – The normalized (and optionally scaled) Series.
Further Details
---------------
This function normalizes a pandas Series by first subtracting the median and then dividing by the standard deviation. If the `scale parameter is set` to True, the normalized values are further scaled to a range between 0 and 1. This normalization process centers the data around zero and adjusts for variability, making the values comparable.

Return type:

Series

Example

import pandas as pd
from poses import normalize_series

# Create a sample pandas Series
sample_series = pd.Series([10, 20, 30, 40, 50])

# Normalize the Series
normalized_series = normalize_series(sample_series, scale=True)

Notes

If all values in the Series are the same, the function returns a Series of zeros.
The optional scaling step ensures that the values are adjusted to a standardized range.

protflow.poses.scale_series(ser)[source]

Scales a pandas Series to a range between 0 and 1.

Parameters:

ser (pd.Series) – The pandas Series to be scaled.

Returns:

pd.Series – The scaled Series with values between 0 and 1.
Further Details
---------------
This function scales a pandas Series to a range between 0 and 1. It ensures that the minimum value in the Series becomes 0 and the maximum value becomes 1, with all other values adjusted proportionately.

Return type:

Series

Example

import pandas as pd
from poses import scale_series

# Create a sample pandas Series
sample_series = pd.Series([10, 20, 30, 40, 50])

# Scale the Series
scaled_series = scale_series(sample_series)

Notes

If all values in the Series are the same, the function returns a Series of zeros.
The scaling process adjusts the values to fit within a standardized range, making them comparable.

protflow.residues module

residues

The residues module is a part of the protflow package and is designed to handle residue selection and related operations in protein structures. This module provides functionality to parse, manipulate, and convert residue selections in various formats, making it an essential tool for bioinformatics and computational biology workflows.

The module includes the ResidueSelection class for representing and manipulating selections of residues, as well as various functions for parsing and converting residue selections.

Classes

ResidueSelection
Represents a selection of residues with functionality for parsing, converting, and manipulating selections.
AtomSelection
Represents an ordered selection of atoms for atom-level operations.

Functions

fast_parse_selection
Fast parser for selections already in ResidueSelection format.
parse_selection
Parses a selection into ResidueSelection formatted selection.
parse_residue
Parses a single residue identifier into a tuple (chain, residue_index).
residue_selection
Creates a ResidueSelection from a selection of residues.
from_dict
Creates a ResidueSelection object from a dictionary specifying a motif.
from_contig
Creates a ResidueSelection object from a contig string.
reduce_to_unique
Reduces an input array to its unique elements while preserving order.

Example Usage

Creating and manipulating ResidueSelection objects:

from residues import ResidueSelection, from_dict, from_contig

# Create a ResidueSelection from a list
selection = ResidueSelection(["A1", "A2", "B3"])

# Convert to string
selection_str = selection.to_string()
print(selection_str)
# Output: A1, A2, B3

# Convert to dictionary
selection_dict = selection.to_dict()
print(selection_dict)
# Output: {'A': [1, 2], 'B': [3]}

# Create a ResidueSelection from a dictionary
selection_from_dict = from_dict({"A": [1, 2], "B": [3]})
print(selection_from_dict.to_string())
# Output: A1, A2, B3

# Create a ResidueSelection from a contig string
selection_from_contig = from_contig("A1-A3, B5")
print(selection_from_contig.to_string())
# Output: A1, A2, A3, B5

This module simplifies the process of handling residue selections in bioinformatics workflows, providing a consistent interface for different types of input and output formats.

class protflow.residues.AtomSelection(atoms)[source]

Bases: object

Represent an ordered selection of atoms in a protein structure.

Atom IDs can be compact IDs (chain_id, res_id, atom_name) using model 0 implicitly, or full BioPython-style IDs with model and structure IDs. Atom ordering is preserved because RMSD calculation pairs atoms by position.

Parameters:

atoms (AtomSelection, dict, list, or tuple) –

Ordered atom selection to normalize. Supported atom ID forms are:

(chain_id, residue_id, atom_name)
(model_id, chain_id, residue_id, atom_name)
(structure_id, model_id, chain_id, residue_id, atom_name)
(structure_id, model_id, chain_id, residue_id, atom_name, altloc)

residue_id can be a compact integer-like value or a BioPython residue ID tuple (hetero_flag, residue_number, insertion_code). atom_name can be a string or a BioPython disordered atom ID tuple (atom_name, altloc). A scorefile-style dictionary with an "atoms" key is also accepted.

atoms

Tuple of normalized atom IDs. Nested lists are converted to tuples so selections can be compared and used in set-like operations.

Type:: tuple

Raises:

TypeError – If atoms is not an AtomSelection, scorefile dictionary, or ordered sequence of atom IDs.
ValueError – If any atom ID has an unsupported shape or invalid chain, residue, or atom-name component.

Parameters:

atoms (Any)

Notes

AtomSelection preserves order deliberately. Many atom-level operations, such as RMSD or geometry calculations, pair atoms by position rather than treating the selection as an unordered set.

Examples

Create a compact atom selection:

atoms = AtomSelection([("A", 1, "N"), ("A", 1, "CA")])

Create the same selection from scorefile-compatible data:

atoms = AtomSelection({"atoms": [["A", 1, "N"], ["A", 1, "CA"]]})

__add__(other)[source]

Combine two AtomSelections while preserving order and uniqueness.

Parameters:

other (AtomSelection) – Selection to append to self. Atoms already present in self are skipped, matching the behavior of ResidueSelection.__add__().

Returns:

AtomSelection – New selection containing all atoms from self followed by atoms from other that were not already present.
NotImplemented – Returned when other is not an AtomSelection, allowing Python’s binary operator fallback behavior.

Examples

a = AtomSelection([("A", 1, "N"), ("A", 1, "CA")])
b = AtomSelection([("A", 1, "CA"), ("A", 1, "C")])
(a + b).to_tuple()
# (("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"))

__init__(atoms)[source]

Normalize and store an ordered atom selection.

Parameters:: atoms (Any)
Return type:: None

__iter__()[source]: Iterate over normalized atom IDs in selection order.

__len__()[source]

Return the number of atom IDs in the selection.

Return type:: int

__str__()[source]

Return a string representation of the tuple-backed selection.

Return type:: str

__sub__(other)[source]

Remove atoms in another AtomSelection from this selection.

Parameters:

other (AtomSelection) – Selection whose atoms should be removed from self.

Returns:

AtomSelection – New selection containing atoms from self whose normalized atom IDs are absent from other. Original order is preserved.
NotImplemented – Returned when other is not an AtomSelection.

Examples

a = AtomSelection([("A", 1, "N"), ("A", 1, "CA")])
b = AtomSelection([("A", 1, "CA")])
(a - b).to_tuple()
# (("A", 1, "N"),)

static from_dict(input_dict, pose=None, residue_id_format='auto')[source]

Create an AtomSelection from a scorefile dict, nested atom dict, or RFD3 dict.

This is the dictionary-oriented constructor for AtomSelection. It supports three dictionary dialects:

{"atoms": [...]} for ProtFlow scorefile-compatible atom selections.
{"A": {1: ["N", "CA"]}} for explicit chain/residue/atom-name mappings.
RFD3 InputSelection dictionaries such as {"A1-2": "BKBN", "LIG": "C1,O1"}.

Parameters:

input_dict (dict) – Dictionary describing an atom selection in one of the supported forms listed above.
pose (str, os.PathLike, Bio.PDB entity, optional) – Input structure used to expand RFD3 aliases or residue-name selectors. A pose is required when values use ALL or TIP, when keys select ligands/residue names, or when exact atom names should be checked against the input structure.
residue_id_format ({"auto", "compact", "biopython"}, optional) – Controls how residue IDs are written when atoms are read from pose. "auto" uses compact integer residue IDs for standard residues and BioPython residue IDs for hetero residues. "compact" always writes integer residue IDs. "biopython" always writes BioPython residue IDs.

Returns:

Normalized atom selection described by input_dict.

Return type:

AtomSelection

Raises:

TypeError – If input_dict is not a dictionary or if atom-name values have an unsupported type.
ValueError – If the dictionary uses structure-dependent syntax but no pose is provided, or if requested atoms/components cannot be resolved.

Examples

Parse scorefile-compatible data:

AtomSelection.from_dict({"atoms": [["A", 1, "N"], ["A", 1, "CA"]]})

Parse a nested chain/residue mapping:

AtomSelection.from_dict({"A": {1: ["N", "CA"], 2: "C,O"}})

Parse an RFD3 InputSelection dictionary against a PDB file:

AtomSelection.from_dict({"A1-2": "BKBN", "LIG": "C1,O1"}, pose="input.pdb")

static from_list(atoms)[source]

Create an AtomSelection from an ordered list or tuple of atom IDs.

Parameters:

atoms (list or tuple) – Ordered atom IDs in any format accepted by AtomSelection. Passing a single atom ID such as ("A", 1, "N") is also supported.

Returns:

Normalized atom selection preserving the order supplied in atoms.

Return type:

AtomSelection

Raises:

TypeError – If atoms is not sequence-like.
ValueError – If any atom ID is malformed.

Examples

AtomSelection.from_list([("A", 1, "N"), ("A", 1, "CA")])

static from_rfd3_contig(input_contig, pose=None, atom_names='ALL', model_id=0, residue_id_format='auto')[source]

Create an AtomSelection from indexed parts of an RFD3 contig string.

Generated-length components such as 10/10-20 and chain breaks like /0 are skipped. With pose provided, atom_names="ALL" expands to the atoms present in the structure and ligand/residue-name components can be resolved. Without a pose, atom_names must be an explicit atom list or an alias that does not require structure context such as BKBN.

Parameters:

input_contig (str) – RFD3 contig string. Indexed residue components such as "A1", "A1-5", and "A1-A5" are converted to atom IDs. Diffused length components and chain breaks are ignored because they do not refer to atoms in the input structure.
pose (str, os.PathLike, Bio.PDB entity, optional) – Input structure used to expand ALL atoms, validate explicit atom names, and resolve ligand/residue-name components. If omitted, only indexed residue components with explicit atom-name values can be parsed.
atom_names (str, list, or tuple, optional) – Atom names to select from every indexed component. Supported RFD3 aliases are "ALL", "BKBN", and "TIP". Explicit names can be supplied as comma-separated strings such as "N,CA,C,O" or as lists/tuples of strings.
model_id (int or str, optional) – BioPython model identifier used when pose is a Structure object or a path to a multi-model file. Defaults to 0.
residue_id_format ({"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.

Returns:

Atom selection for the indexed input components in input_contig.

Return type:

AtomSelection

Raises:

TypeError – If input_contig is not a string.
ValueError – If a selected component or requested atom cannot be resolved, or if structure-dependent syntax is used without pose.

Examples

Select backbone atoms from indexed residues without loading a pose:

AtomSelection.from_rfd3_contig("10,A1-2,/0,B5", atom_names="BKBN")

Select all atoms present in an input structure:

AtomSelection.from_rfd3_contig("A1-2,/0,Z9", pose="input.pdb")

static from_rfd3_input_selection(input_selection, pose=None, model_id=0, residue_id_format='auto')[source]

Create an AtomSelection from an RFD3 InputSelection value.

Supported RFD3 forms are booleans, contig-style strings, and dictionaries whose keys are residue/ligand selections and whose values are atom names, ALL, BKBN, TIP, or explicit atom-name lists. A pose is required for booleans, ALL, TIP, and ligand/residue name selection because those cases need the actual atoms in the input structure.

Parameters:

input_selection (None, bool, str, dict, AtomSelection, list, or tuple) –
RFD3 InputSelection-like value to parse. Supported forms are:

None
Returns an empty AtomSelection.

True / False
Select all atoms in pose or no atoms, respectively.

str
Parses a contig-style selector such as "A1-10,B5" or a ligand/residue name such as "LIG". String selections imply ALL atoms for matching components.

dict
Parses RFD3 dictionary syntax where keys are components and values are atom selectors, e.g. {"A1": "BKBN"}.

AtomSelection or atom-ID list/tuple
Normalizes the existing atom selection directly.
pose (str, os.PathLike, Bio.PDB entity, optional) – Input structure used for syntax that depends on actual atoms or residue names.
model_id (int or str, optional) – BioPython model identifier used for structure-backed parsing.
residue_id_format ({"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.

Returns:

Normalized atom selection represented by input_selection.

Return type:

AtomSelection

Raises:

TypeError – If input_selection has an unsupported type.
ValueError – If the selection requires a structure but pose is absent, or if selected residues/atoms cannot be found.

Notes

This parser mirrors the user-facing RFD3 InputSelection grammar without importing RFD3 or Foundry at runtime. It intentionally returns concrete atom IDs rather than RFD3 masks.

Examples

Parse explicit atoms without a pose:

AtomSelection.from_rfd3_input_selection({"A1-2": "BKBN"})

Parse ligand atoms and TIP atoms from a structure:

AtomSelection.from_rfd3_input_selection({"LIG": "ALL", "A20": "TIP"}, pose="input.pdb")

static from_rfd3_input_spec(input_spec, pose=None, fields=None, include_ligand=True, model_id=0, residue_id_format='auto')[source]

Parse RFD3 InputSelection fields from one InputSpecification.

Returns a dictionary mapping each parsed field name to an AtomSelection. If pose is not provided, input_spec["input"] is used when present. The RFD3 ligand field is included by default even though it is not typed as InputSelection in RFD3 itself.

Parameters:

input_spec (dict) – One RFD3 InputSpecification dictionary, for example one value from an RFD3Params object.
pose (str, os.PathLike, Bio.PDB entity, optional) – Input structure used to resolve InputSelection fields. When omitted, input_spec["input"] is used if present.
fields (list or tuple of str, optional) – InputSelection field names to parse. Defaults to all RFD3 InputSelection fields known to ProtFlow: contig, unindex, select_fixed_atoms, select_unfixed_sequence, select_buried, select_partially_buried, select_exposed, select_hbond_donor, select_hbond_acceptor, and select_hotspots.
include_ligand (bool, optional) – If True (default), parse the RFD3 ligand field into an AtomSelection under the key "ligand".
model_id (int or str, optional) – BioPython model identifier used for structure-backed parsing.
residue_id_format ({"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.

Returns:

Mapping from each parsed input-specification field to the corresponding AtomSelection. Fields absent from input_spec or set to None are omitted.

Return type:

dict[str, AtomSelection]

Raises:

TypeError – If input_spec is not a dictionary.
ValueError – If any requested field cannot be resolved to atoms.

Examples

Parse all atom-level selections from an RFD3 spec:

spec = {
    "input": "input.pdb",
    "contig": "A1-20,/0,50-80",
    "select_fixed_atoms": {"A10": "BKBN", "LIG": "C1,O1"},
    "ligand": "LIG",
}
selections = AtomSelection.from_rfd3_input_spec(spec)
fixed_atoms = selections["select_fixed_atoms"]

static from_rfd3_ligand(ligand, pose, model_id=0, residue_id_format='auto')[source]

Create an AtomSelection from an RFD3 ligand specification.

Ligands can be selected by residue name ("LIG" or "LIG,ACT") or by indexed residue components such as "Z9".

Parameters:

ligand (str) – RFD3 ligand selector. Comma-separated residue names select all matching non-protein residues in the input structure. Indexed residue components such as "Z9" can also be used.
pose (str, os.PathLike, Bio.PDB entity) – Input structure containing the ligand atoms. This argument is required because ligand names must be resolved against the actual structure.
model_id (int or str, optional) – BioPython model identifier used for structure-backed parsing.
residue_id_format ({"auto", "compact", "biopython"}, optional) – Controls residue ID formatting for atoms loaded from pose.

Returns:

Selection containing all atoms selected by the ligand specification.

Return type:

AtomSelection

Raises:

ValueError – If pose is omitted or if the ligand selector does not match the input structure.

Examples

Select all atoms in ligands named LIG and ACT:

AtomSelection.from_rfd3_ligand("LIG,ACT", pose="input.pdb")

to_dict()[source]

Return a scorefile-friendly dictionary representation.

Return type:: dict[str, list[Any]]

to_list()[source]

Return the ordered atom selection in JSON-friendly list format.

Return type:: list[Any]

to_tuple()[source]

Return the ordered atom selection as tuples.

Return type:: tuple[tuple[Any, …], …]

class protflow.residues.ResidueSelection(selection=None, delim=',', fast=False, from_scorefile=False)[source]

Bases: object

Represent a selection of residues in a protein structure.

A selection of residues is represented as a tuple with the hierarchy ((chain, residue_idx), …).

Parameters:

selection (list, optional) – A list of residues in string format, e.g., [“A1”, “A2”, “B3”]. Default is None.
delim (str, optional) – The delimiter used to parse the selection string. Default is “,”.
fast (bool, optional) – If True, parses the selection without any type checking. Use when selection is already in ResidueSelection format. Default is False.
from_scorefile (bool)

residues

A tuple representing the parsed residues selection.

Type:: tuple

Examples

>>> from residues import ResidueSelection
>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_string())
A1, A2, B3
>>> print(selection.to_dict())
{'A': [1, 2], 'B': [3]}

from_selection(selection)[source]

Constructs a ResidueSelection instance from the provided selection.

Parameters:: selection (list or str) – The selection of residues to be parsed.
Returns:: A new ResidueSelection instance.
Return type:: ResidueSelection

to_dict()[source]

Converts the ResidueSelection to a dictionary.

Note

Converting to a dictionary destroys the ordering of specific residues on the same chain in a motif.

Returns:: A dictionary representation of the ResidueSelection with chains as keys and lists of residue indices as values.
Return type:: dict

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_dict())
{'A': [1, 2], 'B': [3]}

to_list(ordering=None)[source]

Converts the ResidueSelection to a list of strings.

Parameters:: ordering (str, optional) – Specifies the ordering of the residues in the output list. Options are “rosetta” or “pymol”. Default is None.
Returns:: The list representation of the ResidueSelection.
Return type:: list of str

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_list())
['A1', 'A2', 'B3']
>>> print(selection.to_list(ordering="rosetta"))
['1A', '2A', '3B']

to_rfdiffusion_contig()[source]

Parses ResidueSelection object to contig string for RFdiffusion.

Example

If self.residues = ((“A”, 1), (“A”, 2), (“A”, 3), (“C”, 4), (“C”, 6)), the output will be “A1-3,C4,C6”.

Return type:: str

to_string(delim=',', ordering=None)[source]

Converts the ResidueSelection to a string.

Parameters:

delim (str, optional) – The delimiter to use in the resulting string. Default is “,”.
ordering (str, optional) – Specifies the ordering of the residues in the output string. Options are “rosetta” or “pymol”. Default is None.

Returns:

ResidueSelection object formatted as a string, separated by :delim: ueSelection.

Return type:

str

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_string())
A1, A2, B3
>>> print(selection.to_string(ordering="rosetta"))
1A, 2A, 3B

protflow.residues.fast_parse_selection(input_selection)[source]

Fast selection parser for pre-formatted selections.

This function is a fast parser for residue selections that are already in the ResidueSelection format. It bypasses any additional type checking or parsing to improve performance when the input is guaranteed to be correctly formatted.

Parameters:: input_selection (tuple of tuple of (str, int)) – A tuple of tuples where each inner tuple represents a residue with the format (chain, residue_index).
Returns:: The input selection, unchanged.
Return type:: tuple of tuple of (str, int)

Examples

>>> input_selection = (("A", 1), ("B", 2), ("C", 3))
>>> fast_parse_selection(input_selection)
(('A', 1), ('B', 2), ('C', 3))

protflow.residues.from_contig(input_contig)[source]

Creates a ResidueSelection object from a contig string.

This function constructs a ResidueSelection instance from a contig string. The contig string can specify ranges of residues using a hyphen (-) to denote the range, with residues separated by commas (,). For example, “A1-A3, B5” specifies residues A1, A2, A3, and B5.

Parameters:: input_contig (str) – A contig string specifying the residues. Ranges can be denoted using hyphens, and residues are separated by commas.
Returns:: An instance of the ResidueSelection class representing the parsed selection of residues.
Return type:: ResidueSelection

Examples

>>> from_contig("A1-A3, B5")
<ResidueSelection object representing ('A', 1), ('A', 2), ('A', 3), ('B', 5)>

>>> from_contig("C1, C3-C5, D2")
<ResidueSelection object representing ('C', 1), ('C', 3), ('C', 4), ('C', 5), ('D', 2)>

protflow.residues.from_dict(input_dict)[source]

Creates a ResidueSelection object from a dictionary.

This function constructs a ResidueSelection instance from a dictionary where the keys represent chain identifiers and the values are lists of residue indices. This format specifies a motif in the following way: {chain: [residues], …}.

Parameters:: input_dict (dict) – A dictionary specifying the motif. The keys are chain identifiers (str) and the values are lists of residue indices (int).
Returns:: An instance of the ResidueSelection class representing the parsed selection of residues.
Return type:: ResidueSelection

Examples

>>> input_dict = {"A": [1, 2], "B": [3, 4]}
>>> from_dict(input_dict)
<ResidueSelection object representing ('A', 1), ('A', 2), ('B', 3), ('B', 4)>

protflow.residues.parse_from_scorefile(input_selection)[source]

Helper to parse ResidueSelection object from ProtFlow scorefile format.

Parameters:: input_selection (dict)
Return type:: tuple[tuple[str, int]]

protflow.residues.parse_residue(residue_identifier)[source]

Parses a single residue identifier into a tuple (chain, residue_index).

This function takes a residue identifier string and parses it into a tuple containing the chain identifier and the residue index. It currently only supports single-letter chain identifiers.

Parameters:: residue_identifier (str) – A string representing the residue identifier. The format is expected to be either “chain+residue_index” or “residue_index+chain”, where “chain” is a single letter and “residue_index” is an integer.
Returns:: A tuple containing the chain identifier and the residue index.
Return type:: tuple of (str, int)

Examples

>>> parse_residue("A123")
('A', 123)

>>> parse_residue("123A")
('A', 123)

Notes

The function determines whether the chain identifier is at the beginning or the end of the string based on whether the first character is a digit.
Only single-letter chain identifiers are supported.

protflow.residues.parse_selection(input_selection, delim=',', fast=False, from_scorefile=False)[source]

Parses a selection into ResidueSelection formatted selection.

This function takes a selection of residues in various formats and parses it into the ResidueSelection format, which is a tuple of tuples. Each inner tuple represents a residue with the format (chain, residue_index).

Parameters:

input_selection (str, list, or tuple) – The selection of residues to be parsed. This can be: - A string with residues separated by a delimiter. - A list or tuple of residue strings. - A list or tuple of lists/tuples, where each inner list/tuple represents a residue.
delim (str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.
fast (bool, optional) – If True, uses fast_parse_selection to bypass type checking and parsing for performance reasons. Use when input_selection is already in the correct format. Default is False.
from_scorefile (bool, optional) – If True, parses a residue selection that was read in from a scorefile (in the form {‘residues’: [[‘A’, 1], [‘B’, 3]}). Default is False.

Returns:

A tuple of tuples where each inner tuple represents a residue in the format (chain, residue_index).

Return type:

tuple of tuple of (str, int)

Raises:

TypeError – If input_selection is not a supported type (str, list, or tuple).

Examples

>>> parse_selection("A1, B2, C3")
(('A', 1), ('B', 2), ('C', 3))

>>> parse_selection(["A1", "B2", "C3"])
(('A', 1), ('B', 2), ('C', 3))

>>> parse_selection([["A", 1], ["B", 2], ["C", 3]])
(('A', 1), ('B', 2), ('C', 3))

>>> parse_selection([("A", 1), ("B", 2), ("C", 3)], fast=True)
(('A', 1), ('B', 2), ('C', 3))

protflow.residues.reduce_to_unique(input_array)[source]

Reduces an input array to its unique elements while preserving order.

This function takes a list or tuple and returns a new list or tuple containing only the unique elements from the input, with their original order preserved. The type of the returned collection matches the type of the input.

Parameters:: input_array (list or tuple) – The input array from which to remove duplicate elements. The order of the elements is preserved.
Returns:: A new list or tuple containing only the unique elements from the input array, with the original order preserved.
Return type:: list or tuple

Examples

>>> reduce_to_unique([1, 2, 2, 3, 1])
[1, 2, 3]

>>> reduce_to_unique(("a", "b", "a", "c", "b"))
('a', 'b', 'c')

Notes

The function uses OrderedDict.fromkeys to remove duplicates while preserving order.
The returned collection is of the same type as the input (list or tuple).

protflow.residues.residue_selection(input_selection, delim=',')[source]

Creates a ResidueSelection from a selection of residues.

This function takes an input selection of residues in various formats and creates a ResidueSelection object. The selection can be provided as a string, list, or tuple.

Parameters:

input_selection (str, list, or tuple) –
The selection of residues to be parsed. This can be:
- A string with residues separated by a delimiter.
- A list or tuple of residue strings.
- A list or tuple of lists/tuples, where each inner list/tuple represents a residue.
delim (str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.

Returns:

An instance of the ResidueSelection class representing the parsed selection of residues.

Return type:

ResidueSelection

Examples

>>> residue_selection("A1, B2, C3")
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>

>>> residue_selection(["A1", "B2", "C3"])
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>

>>> residue_selection([["A", 1], ["B", 2], ["C", 3]])
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>

protflow.runners module

runners module

This module provides functionality for handling the interaction between runners and poses in protein data processing workflows.

It includes classes and utility functions to:

Manage the output from runner processes.
Define abstract runner interfaces.
Parse and manage command-line options and flags for runner processes.

Dependencies:

builtins: logging, os, re
pandas
protflow.poses: Poses, get_format, FORMAT_STORAGE_DICT
protflow.jobstarters: JobStarter

Overview:

The runners module is designed to facilitate the integration of various runner processes with protein pose data, ensuring consistent data formatting, error handling, and integration of results into the Poses class. Utility functions provided in this module support the parsing and handling of command-line options and flags, making it easier to configure and execute runner processes in a flexible manner.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.runners.Runner[source]

Bases: object

Abstract Runner base class

The Runner class provides an abstract base for defining runners that handle the interface between runner processes and the Poses class. It includes methods for running jobs, checking paths, verifying prefixes, preparing pose options, and managing job setup and score files.

Examples

To create a custom runner, subclass Runner and implement the abstract methods:

>>> class MyRunner(Runner):
>>>     def __str__(self):
>>>         return "MyRunner"
>>>
>>>     def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput:
>>>         # Custom implementation for running jobs
>>>         pass

Example usage:

>>> my_runner = MyRunner()
>>> poses = Poses()
>>> jobstarter = JobStarter()
>>> runner_output = my_runner.run(poses, "example_prefix", jobstarter)

exception CrashError[source]

Bases: RuntimeError

Re-raised error with job stderr context when collect_scores fails.

classmethod __init_subclass__(**kwargs)[source]: overwrites subclasses to check for exceptions

__str__()[source]

Abstract method to provide the name of the runner.

This method should be overridden in subclasses to return the name of the runner.

Raises:: NotImplementedError – If the method is not overridden in the subclass.

Examples

>>> class MyRunner(Runner):
>>>     def __str__(self):
>>>         return "MyRunner"

check_for_existing_scorefile(scorefile, overwrite=False)[source]

Checks if a scorefile exists and returns it as a DataFrame if overwrite is False.

Parameters:

scorefile (str) – The path to the scorefile.
overwrite (bool, optional) – Whether to overwrite the scorefile if it exists (default is False).

Returns:

The scorefile as a DataFrame if it exists and overwrite is False. None otherwise.

Return type:

pandas.DataFrame

Examples

>>> runner = MyRunner()
>>> scores_df = runner.check_for_existing_scorefile("/path/to/scorefile.csv")

check_for_prefix(prefix, poses)[source]

Checks if a column with the given prefix already exists in the Poses DataFrame.

Parameters:

prefix (str) – The prefix to be checked.
poses (Poses) – An instance of the Poses class whose DataFrame will be checked.

Raises:

KeyError – If a column with the given prefix already exists in the Poses DataFrame.

Return type:

None

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> runner.check_for_prefix("example_prefix", poses)

generic_run_setup(poses, prefix, jobstarters, make_work_dir=True)[source]

Sets up the runner’s working directory and jobstarter.

Checks if the prefix exists in poses.df, sets up a jobstarter, and creates the working directory if necessary.

Returns absolute path to working directory and the jobstarter that will be used for the runner.

Parameters:

poses (Poses) – An instance of the Poses class.
prefix (str) – The prefix to be used for the setup.
jobstarters (list[JobStarter]) – A list of JobStarter instances to choose from.
make_work_dir (bool, optional) – Whether to create the working directory if it does not exist (default is True).
Note (Order of jobstarters in :jobstarter: parameter is: [Runner.run(jobstarter), Runner.jobstarter, poses.default_jobstarter])

Returns:

A tuple containing the path to the working directory and the selected JobStarter instance.

Return type:

tuple[str, JobStarter]

Raises:

ValueError – If no valid JobStarter is set.

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> jobstarters = [JobStarter(), JobStarter(), JobStarter()]
>>> work_dir, jobstarter = runner.generic_run_setup(poses, "example_prefix", jobstarters)

prep_pose_options(poses, pose_options=None)[source]

Prepares pose options, ensuring they are of the same length as the poses.

Parameters:

poses (Poses) – An instance of the Poses class.
pose_options (list[str], optional) – A list of pose options to be prepared. If not provided, an empty list will be used.

Returns:

A list of prepared pose options.

Return type:

list

Raises:

ValueError – If the length of pose_options does not match the length of poses.

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> prepared_options = runner.prep_pose_options(poses, ["option1", "option2"])

run(poses, prefix, jobstarter)[source]

Abstract method to run jobs and send scores to Poses.

This method should be overridden in subclasses to define the job execution logic and integrate the results into the Poses class.

Parameters:

poses (Poses) – An instance of the Poses class to be processed.
prefix (str) – Prefix to be added to the results columns.
jobstarter (JobStarter) – An instance of the JobStarter class to handle job execution.

Returns:

An instance of the RunnerOutput class containing the processed results.

Return type:

RunnerOutput

Raises:

NotImplementedError – If the method is not overridden in the subclass.

Examples

>>> class MyRunner(Runner):
>>>     def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput:
>>>         # Custom implementation for running jobs
>>>         pass

save_runner_scorefile(scores, scorefile)[source]

Saves the runner’s scorefile based on the file extension format.

Parameters:

scores (pandas.DataFrame) – The DataFrame containing the scores to be saved.
scorefile (str) – The path to the scorefile to be saved.

Raises:

KeyError – If the file extension format is not recognized.

Return type:

None

Examples

>>> runner = MyRunner()
>>> scores_df = pd.DataFrame({'score': [1, 2, 3]})
>>> runner.save_runner_scorefile(scores_df, "/path/to/scorefile.csv")

search_path(input_path, path_name, is_dir=False)[source]

Checks if a given path exists and is valid.

Parameters:

input_path (str) – The path to be checked.
path_name (str) – The name associated with the path, used for error messages.
is_dir (bool)

Returns:

The validated path.

Return type:

str

Raises:

ValueError – If the path is not set or does not exist on the local filesystem.

Examples

>>> runner = MyRunner()
>>> valid_path = runner.search_path("/path/to/file", "example_path")

class protflow.runners.RunnerOutput(poses, results, prefix, index_layers=0, index_sep='_')[source]

Bases: object

RunnerOutput class

The RunnerOutput class handles how protein data is passed between Runner and Poses classes. It ensures the correct formatting of results and facilitates the integration of runner outputs into the Poses data structure.

param poses:: An instance of the Poses class.
type poses:: Poses
param results:: A DataFrame containing the results to be checked and formatted. The DataFrame must contain ‘description’ and ‘location’ columns.
type results:: pandas.DataFrame
param prefix:: A prefix to be added to the results columns.
type prefix:: str
param index_layers:: Number of index layers to remove from the ‘description’ column (default is 0).
type index_layers:: int, optional
param index_sep:: Separator used in the index (default is “_”).
type index_sep:: str, optional

check_data_formatting(results)[source]

Checks if the input DataFrame has the correct format.

Parameters:: results (pandas.DataFrame) – The input DataFrame to be checked. It must contain ‘description’ and ‘location’ columns.
Returns:: The validated and formatted DataFrame.
Return type:: pandas.DataFrame
Raises:: ValueError – If the input DataFrame does not contain the required columns or if the ‘description’ column does not match the ‘location’ column.

return_poses()[source]

Integrates the output of a runner into a Poses class.

This method adds the output of a Runner class formatted in RunnerOutput into Poses.df and returns the updated Poses instance.

Returns:: The updated Poses instance with the integrated runner output.
Return type:: Poses
Raises:: ValueError – If merging DataFrames fails due to no overlap between Poses.df[‘poses_description’] and results[new_df_col] or if some rows in results[new_df_col] were not found in Poses.df[‘poses_description’].

Parameters:

poses (Poses)
results (DataFrame)
prefix (str)
index_layers (int)
index_sep (str)

class protflow.runners.SbatchArrayRunnerTimer(runner)[source]

Bases: Runner

SbatchArrayRunnerTimer Class

Instrumentation wrapper that profiles any ProtFlow Runner on SLURM.

SbatchArrayRunnerTimer wraps an arbitrary Runner instance and, after each call to run(), queries SLURM’s accounting database via get_SLURM_stats() to collect per-job resource statistics. All timing and statistics records are accumulated in history and can be exported at any time via report().

The class inherits from Runner and uses __getattr__() to transparently proxy every attribute lookup to the wrapped runner, so it can serve as a drop-in replacement in any ProtFlow pipeline without modifying the surrounding code.

Warning

Profiling relies on get_SLURM_stats(), which calls sacct and therefore requires the process to be running on the cluster login node. See get_SLURM_stats() for details.

param runner:: Any instantiated ProtFlow Runner (e.g. CalibySequenceDesign, LigandMPNN, etc.) whose run() calls should be timed and profiled.
type runner:: Runner

runner

The wrapped runner instance.

Type:: Runner

history

Accumulated statistics records. Each entry corresponds to one successfully profiled run() call and contains all keys returned by get_SLURM_stats() plus the four keys added by run() (runner_class, prefix, total_python_wall_sec, overhead_plus_queue_sec). Empty until the first successful profiled run completes.

Type:: list of dict

job_ids

SLURM job names recorded for each run() call, in call order. An entry of None indicates that last_job_name could not be retrieved (e.g. because a non-SLURM jobstarter was used and the guard did not fire before the append).

Type:: list of str or None

session_start

ISO-8601 timestamp (YYYY-MM-DDTHH:MM:SS) set at construction time to one minute before instantiation. Passed as start_time to every get_SLURM_stats() call so that only jobs from the current session are returned by sacct, preventing false matches against stale jobs with the same name from earlier sessions.

Type:: str

Notes

__init__ calls super().__init__() to satisfy the Runner base-class contract, making all base-class utilities (e.g. scorefile helpers) available on self in addition to the wrapped self.runner.
session_start is backdated by one minute to guard against off-by-one errors on clusters with coarse sacct timestamp resolution.
history grows unboundedly across run() calls within the same Python session. For very long pipelines, consider calling report() periodically and resetting self.history = [] if memory usage is a concern.

Examples

Wrap a LigandMPNN runner and time three sequential design rounds:

from protflow.runners.ligandmpnn import LigandMPNN
from protflow.runners.sbatch_array_runner_timer import SbatchArrayRunnerTimer

timed_runner = SbatchArrayRunnerTimer(LigandMPNN())

for prefix in ["round1", "round2", "round3"]:
    poses = timed_runner.run(poses, prefix=prefix, nseq=20)

summary = timed_runner.report(prefix="full_pipeline")
print(summary[["prefix", "total_python_wall_sec", "avg_task_runtime_sec"]])

report(prefix=None)[source]

Export accumulated timing and SLURM statistics to disk and return as a DataFrame.

Converts history to a DataFrame and, when prefix is provided, writes two files to the current working directory:

<prefix>_stats.csv — the full statistics table, one row per profiled run() call, written with to_csv() (index column included).
<prefix>_job_ids.txt — a newline-delimited list of all SLURM job names from job_ids, in the order the runs were performed.

Parameters:

prefix (str, optional) – Filename stem for the output files. When None, no files are written and only the in-memory DataFrame is returned. When provided, both output files are created or overwritten in the current working directory.

Returns:

DataFrame built from history, with one row per profiled run() call. Columns are the union of all keys present in history entries. Guaranteed columns (when at least one profiled run has completed) include:

runner_classstr: Class name of the wrapped runner for that run.
prefixstr: The prefix used in that run() call.
total_python_wall_secfloat: Total Python wall-clock time for that run (seconds).
overhead_plus_queue_secfloat: Estimated overhead + queue-wait time (seconds).
job_namestr: SLURM job name queried by get_SLURM_stats().
total_cpu_secint: Total CPU-core-seconds reserved across all tasks.
avg_task_runtime_secfloat: Mean per-task wall-clock elapsed time (seconds).
max_task_runtime_secint: Longest per-task wall-clock elapsed time (seconds).
min_task_runtime_secint: Shortest per-task wall-clock elapsed time (seconds).
num_tasksint: Number of SLURM array tasks.
total_cpus_reservedint: Total CPU cores allocated across all tasks.
statestr: Aggregated job-array completion state.
queried_afterstr or None: The sacct start-time filter used for that query.

Returns an empty DataFrame when history is empty (i.e. before any profiled run has completed, or when all runs used a non-SLURM jobstarter).

Return type:

pandas.DataFrame

Notes

report() is called automatically at the end of every successful profiled run() call using that run’s prefix, so the CSV and job-ID files are always up to date after each run. Manual calls to report() are useful for retrieving an in- memory summary or writing a consolidated report under a different prefix after multiple runs.
The job-ID file is written from job_ids (not from the job_name column of history), which means it includes entries from runs where last_job_name was None or where the non-SLURM guard fired before the append. None values will appear as the literal string "None" in the file.
Output files are written with UTF-8 encoding and will overwrite existing files of the same name without prompting.

Examples

Inspect stats after two runs and write a combined report:

timed = SbatchArrayRunnerTimer(CalibySequenceDesign())
poses = timed.run(poses, prefix="round1", nseq=5)
poses = timed.run(poses, prefix="round2", nseq=10)

df = timed.report(prefix="pipeline_summary")
# Writes:
#   pipeline_summary_stats.csv
#   pipeline_summary_job_ids.txt
print(df[["prefix", "total_python_wall_sec", "avg_task_runtime_sec"]])
#      prefix  total_python_wall_sec  avg_task_runtime_sec
# 0    round1                 245.12                228.40
# 1    round2                 510.87                491.33

In-memory summary without writing files:

df = timed.report()   # prefix=None — no files written
print(df[["state", "num_tasks", "total_cpu_sec"]].to_string())

run(poses, prefix, jobstarter=None, **kwargs)[source]

Execute the wrapped runner and collect timing and SLURM statistics.

Delegates the actual computation to runner via self.runner.run(poses, prefix, jobstarter, **kwargs) and then, if a SbatchArrayJobstarter was used, queries SLURM’s accounting database for per-job resource statistics using get_SLURM_stats(). The combined timing and cluster stats record is appended to history and report() is called automatically to persist an up-to-date CSV and job-ID file.

The method measures time across three consecutive phases:

Phase 1 — wrapper start: time.perf_counter() is captured immediately before delegating to the wrapped runner.
Phase 2 — runner execution: the full body of self.runner.run(), which internally performs ProtFlow setup, submits the SLURM array job, blocks until all tasks complete (wait=True), and post-processes the results.
Phase 3 — wrapper end: time.perf_counter() is captured immediately after the wrapped runner returns.

Parameters:

poses (Poses) – Input pose collection, forwarded verbatim to self.runner.run.
prefix (str) – Column prefix and working-directory identifier forwarded to self.runner.run and used to name the output CSV and job-ID files written by report().
jobstarter (JobStarter, optional) – Job submission backend. When provided, this value is passed to the wrapped runner and is also used to determine whether SLURM accounting can be queried. When omitted, the jobstarter is resolved from self.runner.jobstarter and then from poses.default_jobstarter for the purpose of stat collection.
**kwargs – All additional keyword arguments are forwarded unchanged to self.runner.run, making the timer fully compatible with any runner regardless of its specific signature.

Returns:

Poses – The Poses object returned by the wrapped runner, unchanged. Timing and statistics are stored in history and written to disk by report(); they do not alter the returned poses.
Side Effects
------------
When profiling succeeds (SLURM jobstarter detected and
last_job_name is set), the following side effects occur
* **5-second sleep** inserted via ``time.sleep(5)`` to allow the – SLURM accounting database to synchronise before sacct is queried.
* A statistics dictionary is **appended to** :attr:`history. The` – dictionary contains all keys from get_SLURM_stats() (see its return-value documentation) plus the following four keys added by this method:

runner_classstr
__class__.__name__ of the wrapped runner (e.g. "CalibySequenceDesign").

prefixstr
The prefix argument passed to this call.

total_python_wall_secfloat
Total elapsed wall-clock time in seconds from Python’s perspective (Phase 1 → Phase 3), rounded to 2 decimal places. Encompasses ProtFlow setup, SLURM queue wait, cluster execution, and result post-processing.

overhead_plus_queue_secfloat
total_python_wall_sec minus runtime_sec from SLURM, rounded to 2 decimal places. Approximates the combined cost of ProtFlow overhead and scheduler queue wait. May be negative in rare cases due to clock skew between the login node and compute nodes, or rounding in sacct.
* The SLURM job name is **appended to** :attr:`job_ids.`
* ``<prefix>_stats.csv`` and <prefix>_job_ids.txt are written – (or overwritten) in the current working directory via report().

Warns:

logging.WARNING – Emitted when the resolved jobstarter is not an instance of SbatchArrayJobstarter. Message format: "Stats skipped: <type> does not support SLURM accounting.". Profiling is skipped entirely and the unmodified poses are returned immediately.

Return type:

Poses

Notes

The jobstarter resolution priority is: argument → self.runner.jobstarter → poses.default_jobstarter. This mirrors the fallback chain used by most ProtFlow runners and ensures that the correct jobstarter is identified for stat collection even when it was set on the runner at construction time.
total_python_wall_sec includes SLURM queue wait time because the wrapped runner calls start() with wait=True, blocking until all array tasks complete before returning.
If last_job_name is None (e.g. the jobstarter was never used to submit a job), the stats-collection block is skipped entirely and history is not updated, even though the jobstarter type check passes.

Examples

Basic timed run:

timed = SbatchArrayRunnerTimer(CalibySequenceDesign())
poses = timed.run(
    poses,
    prefix="sd_round1",
    nseq=10,
    jobstarter=SbatchArrayJobstarter(max_cores=50),
)
print(timed.history[-1]["total_python_wall_sec"])    # e.g. 312.45
print(timed.history[-1]["overhead_plus_queue_sec"])  # e.g.  18.72
print(timed.history[-1]["runner_class"])             # "CalibySequenceDesign"
print(timed.history[-1]["state"])                    # "COMPLETED"

Passing runner-specific kwargs transparently:

timed = SbatchArrayRunnerTimer(LigandMPNN())
poses = timed.run(
    poses,
    prefix="mpnn",
    nseq=20,
    model_type="ligand_mpnn",
    fixed_residues_col="binding_site",
)

Non-SLURM jobstarter (profiling skipped, poses still returned):

from protflow.jobstarters import LocalJobStarter
poses = timed.run(poses, prefix="local_test", jobstarter=LocalJobStarter())
# Logs: WARNING - Stats skipped: <class 'LocalJobStarter'>
#                 does not support SLURM accounting.
# timed.history is unchanged.

Parameters:: runner (Runner)

protflow.runners.col_in_df(df, column)[source]

Checks if a column exists in a DataFrame.

This function verifies whether a specified column is present in the given DataFrame. If the column is not found, it raises a KeyError.

Parameters:

df (pandas.DataFrame) – The DataFrame to be checked.
column (str) – The name of the column to be verified.

Raises:

KeyError – If the specified column is not found in the DataFrame.

Return type:

None

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> col_in_df(df, 'A')  # No error raised
>>> col_in_df(df, 'C')  # Raises KeyError
Traceback (most recent call last):
    ...
KeyError: 'Could not find C in poses dataframe! Are you sure you provided the right column name?'

protflow.runners.expand_options_flags(options_str, sep='--')[source]

Simple parsing function to parse options and flags from an input string.

Splits an input string into options and flags only based on a specified separator! If your command has more complex patterns in its options, then switch to “regex_expand_options_flags”. Options are key-value pairs, while flags are standalone keys without values.

Parameters:

options_str (str) – The input string containing options and flags to be parsed.
sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).

Returns:

A tuple containing a dictionary of options and a set of flags.

Return type:

tuple[dict, set]

Examples

>>> options_str = "--width 800 --height 600 --verbose"
>>> opts, flags = expand_options_flags(options_str)
>>> print(opts)
{'width': '800', 'height': '600'}
>>> print(flags)
{'verbose'}

>>> options_str = "--color=blue --debug --timeout=30"
>>> opts, flags = expand_options_flags(options_str)
>>> print(opts)
{'color': 'blue', 'timeout': '30'}
>>> print(flags)
{'debug'}

protflow.runners.options_flags_to_string(options, flags, sep='--', no_quotes=False)[source]

Converts options dictionary and flags list into a single string.

This function combines a dictionary of options and a list of flags into a single command-line style string.

Parameters:

options (dict) – A dictionary of options, where keys are option names and values are option values.
flags (list) – A list of flags (standalone options without values).
sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).
no_quotes (bool, optional) – (default: False) Setting this option to True will disable the quoting of commandline arguments that are separated by whitespaces. For example, if your option is “–my_list=’1 4 6 14’” then you’d want your list quoted. setting no_quotes=True would result in “–my_list=1 4 6 14”, which can cause errors.

Returns:

A string representation of the combined options and flags.

Return type:

str

Examples

>>> options = {'width': '800', 'height': '600'}
>>> flags = ['verbose', 'debug']
>>> options_flags_to_string(options, flags)
" --width=800 --height=600 --verbose --debug"

>>> options = {'color': 'dark blue', 'timeout': '30'}
>>> flags = ['force']
>>> options_flags_to_string(options, flags)
" --color='dark blue' --timeout=30 --force"

protflow.runners.parse_generic_options(options, pose_options, sep='--')[source]

Parses generic options and pose-specific options from two input strings, combining them into a single dictionary of options and a list of flags. Pose-specific options overwrite generic options in case of conflicts. Options are expected to be separated by a specified separator within each input string, with options and their values separated by spaces.

Parameters:

optionsstr: A string of generic options, where different options are separated by the specified separator and each option’s value (if any) is separated by space.
pose_optionsstr: A string of pose-specific options, formatted like the options parameter. These options take precedence over generic options.
sepstr, optional: The separator used to distinguish between different options in both input strings. Defaults to “–“.

Returns:

tuple: A 2-element tuple where the first element is a dictionary of merged options (key-value pairs) and the second element is a list of unique flags (options without values) from both input strings.

Examples:

>>> parse_generic_options("--width 800 --height 600", "--color blue --verbose")
({'width': '800', 'height': '600', 'color': 'blue'}, ['verbose'])

This function internally utilizes a helper function expand_options_flags to process each input string separately before merging the results, ensuring that pose-specific options and flags are appropriately prioritized and duplicates are removed.

Parameters:

options (str)
pose_options (str)

Return type:

tuple[dict, list]

protflow.runners.prepend_cmd(cmds, pre_cmd)[source]

Prepends a single command to all commands in a list.

Parameters:

cmds (list[str]) – A list of commands, where all elements are strings.
pre_cmd (str) – A string containing a command, which should be prepended to all commands in the commands list.

Returns:

A list of all commands with the additional command prepended to each.

Return type:

list[str]

Examples

>>> cmds = [run_inference.sh pose_0001.pdb, run_inference.sh pose_0002.pdb]
>>> pre_cmd = "conda init"
>>> prepend_cmd(cmds, pre_cmd)
"['conda init; run_inference.sh pose_0001.pdb', 'conda init; run_inference.sh pose_0002.pdb']"

protflow.runners.regex_expand_options_flags(options_str, sep='--')[source]

Parses options and flags from an input string using regular expressions.

This function uses regular expressions to split an input string into options and flags. It ensures that separators within quotes are not split.

Parameters:

options_str (str) – The input string containing options and flags to be parsed.
sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).

Returns:

A tuple containing a dictionary of options and a set of flags.

Return type:

tuple[dict, set]

Examples

>>> options_str = '--width 800 --height 600 --verbose'
>>> opts, flags = regex_expand_options_flags(options_str)
>>> print(opts)
{'width': '800', 'height': '600'}
>>> print(flags)
{'verbose'}

>>> options_str = '--color="dark blue" --debug --timeout=30'
>>> opts, flags = regex_expand_options_flags(options_str)
>>> print(opts)
{'color': 'dark blue', 'timeout': '30'}
>>> print(flags)
{'debug'}

Module contents

Package initialization

protflow.get_config()[source]

Return type:: object

protflow.require_config()[source]

Default function to be called in runners to require a set-up config.py file. This function imports and returns protflow.config

Return type:: object