ProtFlow documentation

Subpackages

protflow.config_template module

This module contains all paths to tools integrated in ProtFlow. PRE_CMD are commands that should be run before the runner is executed (e.g. if import of a specific module is necessary for the environment to work)

protflow.jobstarters module

jobstarters

This module, jobstarters, provides a set of classes and methods to facilitate the submission and management of computing jobs on various job scheduling systems. JobStarters are passed to Runner objects in their .run() methods to facilitate a standardized execution of commands generated by the Runner. JobStarters can also be executed outside of Runner classes as is shown in the examples.

The JobStarter class defines a base JobStarter class with methods that need to be implemented by subclasses to start jobs and wait for their completion.

Overview

The module includes the following classes and methods:

Classes

  • JobStarter: An abstract base class that defines the interface for all jobstarters.

  • SbatchArrayJobstarter: A concrete implementation of JobStarter for managing SLURM job arrays.

  • LocalJobStarter: A concrete implementation of JobStarter for managing local jobs.

Usage

To use a jobstarter, instantiate an appropriate subclass (e.g., SbatchArrayJobstarter) and call its start method with the desired commands and options. Use the wait_for_job method if you need to wait for job completion.

Example

>>> from jobstarters import SbatchArrayJobstarter
>>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")

Note

This module is designed to be extended with additional jobstarters for different scheduling systems as needed. If you want to implement your own JobStarter and need assistance, please contact any of the authors of ProtFlow for assistance. We are happy about every contribution!

class protflow.jobstarters.JobStarter(max_cores=None)[source]

Bases: object

Abstract base class for job starters.

This class defines the interface for all job starters. Subclasses should implement methods to start jobs and wait for their completion. It also includes a method to set the maximum number of cores available for the jobs.

Examples

This class is designed to be extended by other classes that implement specific job scheduling systems.

Example subclass implementation:

class CustomJobStarter(JobStarter):
    def start(self, cmds, jobname, wait, output_path):
        # Implementation for starting jobs
        pass

    def wait_for_job(self, jobname, interval):
        # Implementation for waiting for job completion
        pass
Parameters:

max_cores (int)

__init__(max_cores=None)[source]

Initializes the JobStarter with an optional maximum number of cores.

Parameters:

max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is None.

set_last_error_message(error_path, read_bytes=16384)[source]

Saves content of an error logfile.

Parameters:
  • error_path (str) – The path to the error logfile.

  • read_bytes (int, optional) – Defines how many bytes of the log file should be read (starting from the back). Default is 8192.

set_max_cores(cores)[source]

Sets the maximum number of cores available for the jobs.

Parameters:

cores (int) – The maximum number of cores to set.

Return type:

None

start(cmds, jobname, wait, output_path)[source]

Submits a list of commands as jobs to the scheduling system.

Parameters:
  • cmds (list) – A list of commands to be submitted as jobs.

  • jobname (str) – The name of the job.

  • wait (bool) – Whether to wait for the job to complete before proceeding.

  • output_path (str) – The path where output files should be stored.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

Return type:

None

wait_for_job(jobname, interval)[source]

Waits for a job to complete before proceeding.

Parameters:
  • jobname (str) – The name of the job to wait for.

  • interval (float) – The interval (in seconds) at which to check the job status.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

Return type:

None

class protflow.jobstarters.LocalJobStarter(max_cores=1)[source]

Bases: JobStarter

Jobstarter that runs jobs locally using subprocess.run().

This class extends the JobStarter base class to provide functionality for running jobs locally on the machine. It handles the execution of commands using subprocesses, manages the maximum number of concurrent processes, and captures the output and error logs for each command.

Parameters:

max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.

Raises:

ProcessError – If a subprocess crashes during execution.

Examples

Example usage:

>>> from jobstarters import LocalJobStarter
>>> job_starter = LocalJobStarter(max_cores=2)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")
__init__(max_cores=1)[source]

Initializes the LocalJobStarter with an optional parameter for maximum cores.

Parameters:

max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 1.

start(cmds, jobname, wait=True, output_path='./')[source]

Submits a list of commands to be run locally, managing the execution and logging of each command.

Parameters:
  • cmds (list) – List of commands to be executed locally.

  • jobname (str) – Name of the job.

  • wait (bool, optional) – Whether to wait for all commands to complete before returning. Default is True.

  • output_path (str, optional) – Path where output files should be stored. Default is None.

Raises:

ProcessError – If a subprocess crashes during execution.

Return type:

None

wait_for_job(jobname, interval)[source]

(No-op) Method for waiting for started jobs.

Parameters:
  • jobname (str) – Name of the job to wait for.

  • interval (float) – Interval (in seconds) at which to check the job status.

Return type:

None

class protflow.jobstarters.SbatchArrayJobstarter(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]

Bases: JobStarter

Jobstarter that manages the submission of job arrays to SLURM clusters.

This class extends the JobStarter base class to provide functionality specific to SLURM job arrays. It handles tasks such as generating command files, submitting jobs using sbatch, and waiting for job completion. It also supports options for GPU usage and automatic cleanup of command files after job completion.

Parameters:
  • max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.

  • remove_cmdfile (bool, optional) – Whether to remove the command file after job completion. Default is False.

  • options (str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.

  • gpus (bool, optional) – Whether to use GPUs for the job. Default is False.

  • batch_cmds (int)

Raises:

TypeError – If the options parameter is not a string or list.

Examples

Example usage:

>>> from jobstarters import SbatchArrayJobstarter
>>> job_starter = SbatchArrayJobstarter(max_cores=50, remove_cmdfile=True, options="--time=10:00", gpus=True)
>>> job_starter.start(cmds=["echo 'Hello World!'"], jobname="test_job", wait=True, output_path="/path/to/output")
__init__(max_cores=100, remove_cmdfile=False, options=None, gpus=False, batch_cmds=None)[source]

Initializes the SbatchArrayJobstarter with optional parameters.

Parameters:
  • max_cores (int, optional) – The maximum number of cores that can be used for the jobs. Default is 100.

  • remove_cmdfile (bool, optional) – Whether to remove the command file after job completion. Default is False.

  • options (str, optional) – Additional SBATCH options to be used when submitting jobs. Default is None.

  • gpus (bool, optional) – Whether to use GPUs for the job. Default is False.

  • batch_cmds (bool, optional) – Whether to batch the input cmds to the specified number. Default is None.

Note

The options parameter must be set when the Jobstarter is created, not when the .start function is executed.

parse_options(options)[source]

Parses the SBATCH options.

Parameters:

options (object) – SBATCH options in string or list format.

Returns:

Parsed SBATCH options.

Return type:

str

Raises:

TypeError – If the options parameter is not a string or list.

set_options(options, gpus)[source]

Sets the SBATCH options.

Parameters:
  • options (object) – SBATCH options in string or list format.

  • gpus (int) – Number of GPUs to be used per node.

Return type:

None

start(cmds, jobname, wait=True, output_path='./', batch_cmds=None)[source]

Writes commands into a command file and starts an SBATCH job running the command file.

Parameters:
  • cmds (list) – List of commands to be executed as part of the job array.

  • jobname (str) – Name of the job.

  • wait (bool, optional) – Whether to wait for the job to complete before returning. Default is True.

  • output_path (str, optional) – Path where output files should be stored. Default is “./”.

  • batch_cmds (bool, optional) – Whether to batch the input cmds to the specified number. Default is None.

Raises:

RuntimeError – If the SLURM submission fails.

Return type:

None

wait_for_job(jobname, interval=5)[source]

Waits for SLURM jobs to be finished.

Parameters:
  • jobname (str) – Name of the job to wait for.

  • interval (float, optional) – Interval (in seconds) at which to check the job status. Default is 5.

Return type:

None

protflow.jobstarters.add_timestamp(x)[source]

Adds a unique timestamp to a string using the time library.

This function appends a unique timestamp to the given string. The timestamp is generated using the current time, which ensures that the resulting string is unique in most cases. The timestamp is added as a suffix, separated by an underscore.

Parameters:

x (str) – The input string to which the timestamp will be added.

Returns:

The input string with a unique timestamp appended.

Return type:

str

Examples

>>> add_timestamp("jobname")
'jobname_1632417284'

Notes

The timestamp is derived from the current time in seconds since the epoch, with the fractional part of the seconds included to ensure higher precision and uniqueness.

protflow.jobstarters.split_list(input_list, element_length=None, n_sublists=None)[source]

Splits a list into nested sublists with specified lengths or number of sublists.

This function divides the input list into a nested list of sublists. The division can be based on the maximum length of each sublist or the desired number of sublists. Only one of the parameters, element_length or n_sublists, should be specified at a time.

Parameters:
  • input_list (list) – The list to be split into sublists.

  • element_length (int, optional) – The maximum length of each sublist. If specified, the input list will be split into sublists each having up to element_length elements.

  • n_sublists (int, optional) – The desired number of sublists. If specified, the input list will be divided into n_sublists sublists.

Returns:

A nested list containing the sublists.

Return type:

list

Raises:

ValueError – If both element_length and n_sublists are specified or if neither is specified.

Examples

Splitting a list into sublists of a specified maximum length:

>>> split_list([1, 2, 3, 4, 5, 6], element_length=2)
[[1, 2], [3, 4], [5, 6]]

Splitting a list into a specified number of sublists:

>>> split_list([1, 2, 3, 4, 5, 6], n_sublists=3)
[[1, 2], [3, 4], [5, 6]]

Notes

  • If n_sublists is specified and is greater than the length of the input list, the number of sublists will be equal to the length of the input list.

  • If neither element_length nor n_sublists is provided, or if both are provided, a ValueError will be raised.

protflow.poses module

poses Module

This module provides functionalities for handling and manipulating protein data within the ProtFlow framework. It focuses on managing protein data represented as Pandas DataFrames, allowing for efficient parsing, storage, and manipulation of protein data across various file formats. The module facilitates complex protein study workflows and integrates seamlessly with other components of the ProtFlow package.

Detailed Description

The poses module offers a robust class, Poses, designed to encapsulate the functionality necessary to manage protein data. It supports various operations such as setting up work directories, parsing protein data, and integrating outputs from different computational processes. The module ensures that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

Key Features

  • Parsing Protein Data: Supports reading protein data from various file formats like JSON, CSV, Pickle, Feather, and Parquet.

  • Data Storage and Retrieval: Allows storing and retrieving protein data in multiple formats, facilitating easy data management.

  • Integration with ProtFlow: Seamlessly integrates with ProtFlow’s job management components, enhancing its utility in distributed computing environments.

  • Advanced Data Manipulation: Provides functionalities to merge and prefix data from various sources, making it easier to handle complex datasets.

  • Flexible and Customizable: Users can customize the data handling processes through various parameters, enabling tailored data management solutions.

Usage

To use this module, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Further operations using poses_instance
poses_instance.save_scores('path/to/save/scores')
poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses')

Examples

Here is an example of how to initialize and use the Poses class for managing protein data:

from poses import Poses

# Create an instance of the Poses class
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Perform various operations using the instance
poses_instance.set_work_dir('new/work/dir')
poses_instance.save_scores('path/to/save/scores', out_format='csv')
filtered_poses = poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>')

Further Details

  • Edge Cases: The module handles various edge cases such as empty pose lists and the need to overwrite previous results. It includes robust error handling and logging for easier debugging and verification.

  • Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.

  • Integration: The module integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to manage protein data within their computational workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Authors

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.poses.Poses(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]

Bases: object

Poses Class

The Poses class within the ProtFlow package is designed for handling protein data, enabling the parsing, storage, and manipulation of protein data represented as Pandas DataFrames. This class facilitates the management of complex protein study workflows and integrates seamlessly with other components of the ProtFlow framework.

Detailed Description

The Poses class encapsulates the functionality necessary for comprehensive management of protein data. It supports various operations, including setting up work directories, parsing protein data from different sources, integrating outputs from different runners, and handling protein data in multiple file formats. This class is essential for users looking to streamline their protein data management within computational workflows.

Key Features

  • Work Directory Setup: Easily sets up and manages work directories for storing intermediate and final results.

  • Data Parsing: Parses protein data from various sources and formats, including JSON, CSV, Pickle, Feather, and Parquet.

  • Data Storage and Retrieval: Stores and retrieves protein data in multiple file formats, ensuring flexibility in data management.

  • Job Management Integration: Integrates with ProtFlow’s job management components, facilitating the handling of protein data in distributed computing environments.

  • Advanced Data Manipulation: Supports operations like merging, prefixing, and duplicating data, providing robust data manipulation capabilities.

  • Filtering and Scoring: Offers methods to filter protein data based on various criteria and calculate composite scores for better data analysis.

  • Pose Handling: Manages protein poses, including loading, saving, and converting between different formats (e.g., PDB to FASTA).

Usage

To use this class, create an instance of the Poses class and utilize its methods to manage protein data. Here is an example demonstrating its usage within a ProtFlow pipeline:

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Set up the work directory
poses_instance.set_work_dir('path/to/new_work_dir')

# Parse and manipulate poses
poses_instance.set_poses(poses=my_protein_data)
poses_instance.save_scores('path/to/save/scores', out_format='csv')

# Filter poses
filtered_poses = poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='filtered_poses')

# Calculate a composite score
poses_instance.calculate_composite_score(name='composite_score', scoreterms=['score1', 'score2'], weights=[0.5, 0.5], plot=True)

Further Details

  • Edge Cases: The class handles various edge cases, such as empty pose lists, the need to overwrite previous results, and handling multiline FASTA inputs.

  • Customizability: Users can customize the data handling process through multiple parameters, including storage formats, pose-specific parameters, and job management settings.

  • Integration: The class integrates seamlessly with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

  • Error Handling: Includes robust error handling and logging for easier debugging and verification of data processing steps.

- `df`

A DataFrame to store protein data.

Type:

pd.DataFrame

- `work_dir`

The working directory for storing data and results.

Type:

str

- `storage_format`

The format for storing protein data (e.g., ‘json’, ‘csv’).

Type:

str

- `default_jobstarter`

The default job starter for managing jobs.

Type:

JobStarter

Notes

This class is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

__init__(poses=None, work_dir=None, storage_format='json', glob_suffix=None, jobstarter=<protflow.jobstarters.SbatchArrayJobstarter object>)[source]

Initializes the Poses class with optional parameters for poses, working directory, storage format, glob suffix, and job starter.

Parameters:
  • poses (list, optional) – A list of paths to the protein data files to be managed. If not provided, an empty DataFrame is initialized.

  • work_dir (str, optional) – The working directory where intermediate and final results will be stored. If not provided, the current directory is used.

  • storage_format (str, optional) – The format used for storing protein data (default is ‘json’). Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.

  • glob_suffix (str, optional) – A suffix used for globbing multiple files. This allows for batch processing of files matching the given pattern.

  • jobstarter (JobStarter, optional) – An instance of the JobStarter class used to manage job submissions. The default is an instance of SbatchArrayJobstarter from the jobstarters module.

df

A DataFrame to store protein data.

Type:

pd.DataFrame

work_dir

The working directory for storing data and results.

Type:

str

storage_format

The format for storing protein data.

Type:

str

default_jobstarter

The default job starter for managing jobs.

Type:

JobStarter

Notes

This method initializes the Poses class and sets up various attributes required for managing protein data. It prepares the environment for subsequent data manipulation and analysis operations.

Example

from poses import Poses

# Initialize the Poses class with protein data and a working directory
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')
calculate_composite_score(name, scoreterms, weights, plot=False, scale_output=False)[source]

Calculates a composite score from specified score columns, applying weights and normalization, and optionally generates a plot.

Parameters:
  • name (str) – The name of the new composite score column to be created.

  • scoreterms (list[str]) – The list of score columns to be included in the composite score.

  • weights (list[float]) – The list of weights corresponding to each score column.

  • plot (bool, optional) – If True, generates a plot of the composite score and the individual score terms (default is False).

  • scale_output (bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).

Returns:

The updated Poses instance with the new composite score column.

Return type:

Poses

Raises:
  • ValueError – If the number of scoreterms and weights do not match.

  • TypeError – If any score column contains non-numeric values.

Further Details

This method calculates a composite score from multiple score columns by applying the specified weights and normalizing the columns. The normalization process involves subtracting the median and dividing by the standard deviation for each score column. Optionally, the composite score can be scaled to a range between 0 and 1.

The method ensures that each score column contains numeric values and applies the normalization process as follows: 1. Calculate the median and standard deviation of each score column. 2. Normalize the column by subtracting the median and dividing by the standard deviation. 3. Optionally scale the normalized values to a range between 0 and 1.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate a composite score
poses_instance.calculate_composite_score(
    name='composite_score',
    scoreterms=['score1', 'score2'],
    weights=[0.5, 0.5],
    plot=True,
    scale_output=True
)

Notes

  • The method ensures that the number of scoreterms and weights match.

  • Normalization helps in making the scores comparable by removing scale differences.

  • Generates a violin plot if the plot parameter is set to True, showing the distribution of the composite score and individual score terms.

calculate_max_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the maximum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.

Parameters:
  • name (str) – The name of the new column where the maximum values will be stored.

  • score_col (str) – The name of the column from which to calculate the maximum value.

  • skipna (bool, optional) – Whether to skip NA/null values. Default is False.

  • remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.

  • sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the maximum values added to the DataFrame.

Return type:

self

Raises:
  • TypeError – If remove_layers is not an integer.

  • ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the maximum values
poses_instance.calculate_max_score(
    name='max_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)
calculate_mean_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the mean score of the selected score column. If remove_layers is set, calculates mean scores over poses grouped by the description column with the set number of index layers removed.

Parameters:
  • name (str) – The name of the new column where the mean scores will be stored.

  • score_col (str) – The name of the column from which to calculate the mean scores.

  • skipna (bool, optional) – Whether to skip NA/null values. Default is False.

  • remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.

  • sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:
  • TypeError – If remove_layers is not an integer.

  • ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the mean score
poses_instance.calculate_mean_score(
    name='mean_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)
calculate_median_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the median score of the selected score column. If remove_layers is set, calculates median scores over poses grouped by the description column with the set number of index layers removed.

Parameters:
  • name (str) – The name of the new column where the mean scores will be stored.

  • score_col (str) – The name of the column from which to calculate the median scores.

  • skipna (bool, optional) – Whether to skip NA/null values. Default is False.

  • remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.

  • sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:
  • TypeError – If remove_layers is not an integer.

  • ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the median score
poses_instance.calculate_median_score(
    name='median_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)
calculate_min_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the minimum value of the selected score column. If remove_layers is set, calculates the maximum value over poses grouped by the description column with the set number of index layers removed.

Parameters:
  • name (str) – The name of the new column where the minimum values will be stored.

  • score_col (str) – The name of the column from which to calculate the minimum value.

  • skipna (bool, optional) – Whether to skip NA/null values. Default is False.

  • remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.

  • sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the minimum values added to the DataFrame.

Return type:

self

Raises:
  • TypeError – If remove_layers is not an integer.

  • ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the minimum values
poses_instance.calculate_min_score(
    name='min_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)
calculate_std_score(name, score_col, skipna=False, remove_layers=None, sep='_')[source]

Calculate the standard deviation of the selected score column. If remove_layers is set, calculates standard deviations over poses grouped by the description column with the set number of index layers removed.

Parameters:
  • name (str) – The name of the new column where the mean scores will be stored.

  • score_col (str) – The name of the column from which to calculate the standard deviation.

  • skipna (bool, optional) – Whether to skip NA/null values. Default is False.

  • remove_layers (int, optional) – The number of layers to remove from the index for grouping. If None, no layers are removed. Default is None.

  • sep (str, optional) – The separator used in the ‘poses_description’ column for splitting and joining layers. Default is “_”.

Returns:

The instance of the class with the mean scores added to the DataFrame.

Return type:

self

Raises:
  • TypeError – If remove_layers is not an integer.

  • ValueError – If score_col does not exist in the DataFrame.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Calculate the standard deviation
poses_instance.calculate_std_score(
    name='mean_score1',
    score_col='score1',
    skipna=True,
    remove_layers=1,
)
change_poses_dir(poses_dir, copy=False, overwrite=False)[source]

Changes the directory of the stored poses, with options to copy or overwrite existing poses.

Parameters:
  • poses_dir (str) – The new directory where the poses will be located.

  • copy (bool, optional) – If True, the poses will be copied to the new directory (default is False).

  • overwrite (bool, optional) – If True, existing files in the new directory will be overwritten (default is False).

Returns:

  • Poses – The updated Poses instance with poses located in the new directory.

  • Further Details

  • ---------------

  • This method updates the paths of the stored poses to a new directory. If the `copy parameter is set` to True, the poses are copied to the new directory. The `overwrite parameter controls whether existing files in the new directory are overwritten.`

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses(poses=my_protein_data, work_dir='path/to/work_dir')

# Change the directory of the poses
poses_instance.change_poses_dir('path/to/new_poses_dir', copy=True, overwrite=True)

Notes

  • If copy is set to False, the method only updates the paths in the DataFrame without moving the files.

  • Raises a ValueError if the new directory does not exist or if the poses do not exist in the specified directory (when copy is False).

  • Ensures the integrity of the poses by verifying their existence in the new directory.

check_poses_df_integrity(df)[source]

Checks the integrity of the poses DataFrame, ensuring it contains necessary columns.

Parameters:

df (pd.DataFrame) – The DataFrame to be checked for integrity.

Returns:

The validated poses DataFrame.

Return type:

pd.DataFrame

Raises:

KeyError – If the DataFrame does not contain the mandatory columns ‘input_poses’, ‘poses’, and ‘poses_description’.

Further Details

This method verifies that the poses DataFrame contains the necessary columns required for proper functioning. It ensures that the DataFrame has ‘input_poses’, ‘poses’, and ‘poses_description’ columns, which are essential for various operations.

Example

from poses import Poses
import pandas as pd

# Initialize the Poses class
poses_instance = Poses()

# Create a sample DataFrame
sample_df = pd.DataFrame({
    'input_poses': ['path/to/pose1.pdb'],
    'poses': ['path/to/pose1.pdb'],
    'poses_description': ['pose1']
})

# Check the integrity of the DataFrame
validated_df = poses_instance.check_poses_df_integrity(sample_df)

Notes

  • The method raises a KeyError if any of the mandatory columns are missing.

  • Ensures that the DataFrame is properly structured for further data manipulation and analysis.

check_prefix(prefix)[source]

Checks if the given prefix is already used in the poses DataFrame.

Parameters:

prefix (str) – The prefix to be checked in the poses DataFrame.

Raises:

KeyError – If the prefix is already used in the poses DataFrame.

Return type:

None

Further Details

This method verifies whether the specified prefix is already in use within the poses DataFrame. It is useful for ensuring that new prefixes do not conflict with existing ones, maintaining data integrity.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Check if a prefix is already used
poses_instance.check_prefix('new_prefix')

Notes

  • The method raises a KeyError if the prefix is found in the DataFrame, indicating a conflict.

  • Ensures that new prefixes are unique and can be safely used for new columns or attributes.

convert_pdb_to_fasta(prefix, update_poses=False, chain_sep=':')[source]

Converts PDB pose files to FASTA format and optionally updates the poses. Paths to fasta location are saved in poses dataframe under column <prefix>_fasta_location.

Parameters:
  • prefix (str) – The prefix used for naming the output FASTA files.

  • update_poses (bool, optional) – If True, updates the poses DataFrame to use the new FASTA files (default is False).

  • chain_sep (str, optional) – The separator used for chain identifiers in the FASTA file (default is “:”).

Raises:

RuntimeError – If the poses are not of type PDB.

Return type:

None

Further Details

This method converts PDB pose files to FASTA format and stores them in a directory named with the given prefix. It can also update the poses DataFrame to use the new FASTA files if specified.

Example

from poses import Poses

# Initialize the Poses class with some PDB poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Convert the PDB files to FASTA format
poses_instance.convert_pdb_to_fasta(prefix='converted', update_poses=True)

Notes

  • The method checks that the poses are of type PDB before conversion.

  • Creates a new directory within the working directory to store the FASTA files.

  • Logs the conversion process and verifies the creation of FASTA files.

convert_resselection_cols(resselection_col='import_resselection_cols')[source]

Converts per-row residue selection descriptors into ResidueSelection objects for the columns listed in a list-like selector column, mutating the DataFrame in place.

Parameters:

resselection_col (str, optional) – Name of the column that, for each row, contains a list/tuple of target column names to convert (default is import_resselection_cols). When reading from CSV, this field may be a stringified list (e.g., ['a','b']), which will be parsed automatically.

Returns:

This method modifies self.df in place and returns None. If resselection_col is not present in self.df, the method exits early.

Return type:

None

Raises:
  • KeyError – If a row’s value in resselection_col exists but is not a list or tuple (after optional string-to-list parsing).

  • ValueError – If parsing a stringified list with ast.literal_eval fails due to an invalid literal.

  • SyntaxError – If parsing a malformed stringified list triggers a syntax error.

  • TypeError – If constructing a ResidueSelection from a cell value raises a type error.

Further Details

For each row, the method reads the list of target column names from resselection_col and attempts to convert the corresponding cells:

  • If a target column listed for a row does not exist in self.df, a warning is logged and that column is skipped for the row.

  • If the target cell is already a ResidueSelection instance, it is left unchanged.

  • If the target cell is a str, it is converted via ResidueSelection(value) (useful for CSV imports).

  • If the target cell is a dict, it is converted via ResidueSelection(value, from_scorefile=True) (useful for JSON imports).

  • Empty selector lists are allowed and simply result in no action for that row.

  • Cells that are falsy (e.g., None, empty string, empty dict) are skipped.

Example

import pandas as pd
from protflow.poses import poses

# Sample DataFrame where each row specifies which columns to convert
df = pd.DataFrame({
    "import_resselection_cols": [
        ["fixed_residues", "motif_residues"],  # row 0: convert two columns
        "['motif_residues']",                  # row 1: stringified list (from CSV)
        []                                     # row 2: nothing to convert
    ],
    "fixed_residues": [
        "A12,A34,A56",        # str -> ResidueSelection(str)
        None,                 # skipped
        "A1"
    ],
    "motif_residues": [
        {"residues":[["A",164],["A",165],["A",166],["A",167]]},  # dict -> ResidueSelection(dict, from_scorefile=True)
        "B5-B9",                           # str -> ResidueSelection(str)
        {}
    ]
})

poses = Poses(df)
poses.convert_resselection_cols()  # mutates poses.df in place

# After this call:
# - df.loc[0, "fixed_residues"] is a ResidueSelection instance
# - df.loc[0, "motif_residues"] is a ResidueSelection instance (from dict)
# - df.loc[1, "motif_residues"] is a ResidueSelection instance
# - Row 2 remains unchanged due to empty selector and falsy cells

Notes

  • Missing target columns are not fatal; a warning is logged and processing continues.

  • When importing from CSV, stringified lists in resselection_col are parsed with ast.literal_eval; malformed strings will raise ValueError or SyntaxError.

  • ResidueSelection construction is delegated; any errors it raises will propagate.

determine_pose_type(pose_col=None)[source]

Determines the file types of the poses based on their extensions.

Parameters:

pose_col (str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).

Returns:

  • list – A list of unique file extensions found in the pose file paths.

  • Further Details

  • ---------------

  • This method extracts and identifies the file extensions of the pose file paths in the specified column. It returns a list of unique file extensions, which helps in understanding the types of files being managed.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Determine the pose file types
pose_types = poses_instance.determine_pose_type()

Notes

  • The method logs a warning if multiple file extensions are found.

  • If no file extensions are found, it logs a warning indicating the inability to determine file types.

  • Ensures that the returned list contains only unique file extensions.

duplicate_poses(output_dir, n_duplicates, overwrite=False)[source]

Duplicates poses a specified number of times and saves them to an output directory.

Parameters:
  • output_dir (str) – The directory where the duplicated poses will be saved.

  • n_duplicates (int) – The number of duplicates to create for each pose.

  • Details (Further)

  • ---------------

  • them. (This method creates multiple copies of each pose file and saves them to the specified output directory. The duplicated files are named with an incremented index to distinguish)

  • overwrite (bool)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Duplicate the poses
poses_instance.duplicate_poses(output_dir='path/to/duplicates', n_duplicates=3)

Notes

  • The method creates the output directory if it does not exist.

  • Ensures that the duplicated files have unique names by appending an index.

  • Logs the duplication process and verifies the creation of duplicate files.

filter_poses_by_rank(n, score_col, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None)[source]

Filters poses based on their rank in a specified score column, with options to handle layers and generate plots.

Parameters:
  • n (float) – The number of top-ranked poses to keep. If n < 1, it represents a fraction of the total poses.

  • score_col (str) – The column in the DataFrame containing the scores used for ranking.

  • group_col (str, optional) – Group dataframe by this column and filter individual groups.

  • remove_layers (int, optional) – The number of layers to remove from the pose descriptions before ranking. This helps in grouping similar poses.

  • layer_col (str, optional) – The column used for layer-based grouping of poses (default is “poses_description”).

  • sep (str, optional) – The separator used in the layer descriptions (default is “_”).

  • ascending (bool, optional) – If True, ranks poses in ascending order of scores; otherwise, in descending order (default is True).

  • prefix (str, optional) – The prefix used for naming the output filtered poses file and plot.

  • plot (bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).

  • plot_cols (list[str], optional) – Add additional plotting data to the output filtering plot.

  • overwrite (bool, optional) – If True, overwrites existing filtered poses files (default is True).

  • storage_format (str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).

Returns:

  • Poses – The updated Poses instance with filtered poses.

  • Further Details

  • ---------------

  • This method filters the poses DataFrame to retain only the top-ranked poses based on their scores. It supports fractional ranking, layer-based grouping, and optional plot generation for visualizing the filtering process. The filtered poses can be saved to a file with a specified prefix and storage format.

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Filter poses by rank
poses_instance.filter_poses_by_rank(n=10, score_col='score', prefix='top_poses', plot=True)

Notes

  • The method creates a filtered poses file and an optional plot in the specified working directory.

  • Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.

  • Logs the filtering process, including any errors or warnings related to the ranking criteria.

filter_poses_by_value(score_col, value, operator, prefix=None, plot=False, plot_cols=None, overwrite=True, storage_format=None, fail_on_empty=True)[source]

Filters poses based on a specified value in a score column, with options to generate plots.

Parameters:
  • score_col (str) – The column in the DataFrame containing the scores used for filtering.

  • value (float or int) – The value used as the threshold for filtering poses.

  • operator (str) – The comparison operator used for filtering (‘>’, ‘>=’, ‘<’, ‘<=’, ‘=’, ‘!=’).

  • prefix (str, optional) – The prefix used for naming the output filtered poses file and plot.

  • plot (bool, optional) – If True, generates a plot comparing scores before and after filtering (default is False).

  • plot_cols (list[str], optional) – Add additional plotting data to the output filtering plot.

  • overwrite (bool, optional) – If True, overwrites existing filtered poses files (default is True).

  • storage_format (str, optional) – The format used for storing the filtered poses (default is None, which uses the existing storage format).

  • fail_on_empty (bool)

Returns:

The updated Poses instance with filtered poses.

Return type:

Poses

Raises:

ValueError – If all poses are removed based on the filtering criteria.

Further Details

This method filters the poses DataFrame based on a specified value in a score column, using the provided comparison operator. It supports optional plot generation for visualizing the filtering process and allows saving the filtered poses to a file with a specified prefix and storage format.

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Filter poses by value
poses_instance.filter_poses_by_value(score_col='score', value=0.5, operator='>', prefix='filtered_poses', plot=True)

Notes

  • The method creates a filtered poses file and an optional plot in the specified working directory.

  • Ensures that the DataFrame is properly filtered based on the provided criteria.

  • Logs the filtering process, including any errors or warnings related to the filtering criteria.

  • Raises a ValueError if the filtering criteria remove all poses, ensuring that the Poses instance retains valid data.

get_pose(pose_description, all_models=False)[source]

Retrieves a pose structure based on its description.

Parameters:
  • pose_description (str) – The description of the pose to be retrieved.

  • all_models (bool, optional) – If all models in the input PDB should be returned (all_models = True) or just the first (all_models = False). If False, a Bio.PDB Model is returned, if True, a Bio.PDB Structure is returned.

Returns:

The Bio.PDB Model or Structure object corresponding to the specified pose description.

Return type:

Bio.PDB.Model.Model or Bio.PDB.Structure.Structure

Raises:

KeyError – If the pose description is not found in the poses DataFrame.

Further Details

This method locates the pose file based on its description and loads it as a Bio.PDB Structure object. It is useful for accessing specific pose structures for further analysis or manipulation.

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Retrieve a specific pose structure
pose_structure = poses_instance.get_pose('pose1')

Notes

  • The method uses the ‘poses_description’ column to locate the specified pose.

  • Ensures that the returned pose is loaded as a Bio.PDB Structure object for further processing.

load_poses(poses_path)[source]

Loads poses from a specified file and updates the Poses instance.

Parameters:

poses_path (str) – The path to the file containing the poses to be loaded.

Returns:

  • Poses – The updated Poses instance with poses loaded from the specified file.

  • Further Details

  • ---------------

  • This method reads a file containing poses and updates the Poses instance with the data. The file format is automatically detected based on the file extension, and the corresponding loading function is used to read the data into a DataFrame.

Return type:

Poses

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Load poses from a file
poses_instance.load_poses('path/to/poses.json')

Notes

  • The method supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.

  • Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.

parse_descriptions(poses=None)[source]

Parses descriptions from the provided pose file paths.

Parameters:

poses (list, optional) – A list of pose file paths from which descriptions are extracted.

Returns:

  • list – A list of descriptions parsed from the pose file paths.

  • Further Details

  • ---------------

  • This method extracts descriptions from the provided list of pose file paths. Descriptions are derived from the file names by stripping the directory path and file extension.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Parse descriptions from pose file paths
descriptions = poses_instance.parse_descriptions(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

Notes

  • This method is useful for generating a list of concise descriptions based on file names.

  • Ensures that descriptions are derived in a consistent format, suitable for use in data management and analysis.

parse_poses(poses=None, glob_suffix=None)[source]

Parses the input poses, which can be provided as a list or a directory with a glob suffix.

Parameters:
  • poses (Union[list, str], optional) – A list of file paths or a directory containing the protein data files. If not provided, an empty list is returned.

  • glob_suffix (str, optional) – A suffix used for globbing multiple files in the specified directory.

Returns:

  • list – A list of parsed pose file paths.

  • Further Details

  • ---------------

  • This method handles various input types for parsing poses. It can parse a list of file paths directly or glob files in a specified directory using a suffix. The method ensures that all specified files exist and raises appropriate errors if they do not.

Return type:

list

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Parse poses from a directory with a glob suffix
parsed_poses = poses_instance.parse_poses(poses='path/to/pose_dir', glob_suffix='*.pdb')

Notes

  • Raises FileNotFoundError if any specified files do not exist.

  • Supports both single file and multiple file (via globbing) inputs.

  • Ensures that the returned list contains valid file paths.

poses_list()[source]

Returns a list of pose file paths from the DataFrame.

Returns:

  • list – A list of pose file paths.

  • Further Details

  • ---------------

  • This method extracts the pose file paths from the 'poses' column of the DataFrame and returns them as a list. It provides a convenient way to access the stored pose file paths.

Return type:

list[str]

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Get the list of pose file paths
pose_paths = poses_instance.poses_list()

Notes

  • The method assumes that the ‘poses’ column exists in the DataFrame.

  • Provides a simple way to retrieve all pose file paths managed by the Poses instance.

reindex_poses(prefix, group_col=None, remove_layers=None, force_reindex=False, sep='_', overwrite=False)[source]

Removes index layers from poses. Saves reindexed poses to an output directory.

Parameters:
  • prefix (str) – The directory where the duplicated poses will be saved and the prefix for the DataFrame columns containing the original paths and descriptions.

  • group_col (str, optional) – The poses dataframe column on which to group to create new descriptions. Must be a column in ‘poses_description’ or ‘poses’ format (e.g. from a previous state, before runners appended index layers)

  • remove_layers (int, optional) – The number of index layers to remove.

  • force_reindex (bool, optional) – Add a new index layer to all poses.

  • sep (str, optional) – The separator used to split the description column into layers.

  • Details (Further)

  • ---------------

  • (_0001 (This method removes index layers from poses)

  • _0002

  • provided (etc). If a group column is)

  • 0 (the poses are assigned names according to the group. If remove_layers is above)

  • accordingly. (subtracts the set number of layers from the description column and groups the poses)

  • True (If force_reindex is)

  • poses. (adds one index layer to all)

  • overwrite (bool)

Return type:

None

Notes

  • The method creates the output directory if it does not exist.

  • Raises a KeyError if both group_col and remove_layers are set..

  • Raises a RuntimeError if multiple poses with identical description after index layer removal are found and force_reindex is False..

reset_poses(new_poses_col='input_poses', force_reset_df=False)[source]

Resets the poses DataFrame to the original input poses, with an option to force reset.

Parameters:
  • new_poses_col (str, optional) – The column in the DataFrame containing the new pose file paths (default is ‘input_poses’).

  • force_reset_df (bool, optional) – If True, forces a reset of the DataFrame even if the number of new poses does not match the original (default is False).

  • Details (Further)

  • ---------------

  • parameter. (This method resets the poses DataFrame to use the original input poses. It handles multiline FASTA inputs and ensures that the DataFrame structure is preserved or reset based on the force_reset_df)

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Reset the poses to the original input poses
poses_instance.reset_poses()

Notes

  • The method ensures that the new poses are unique and properly formatted.

  • Raises a RuntimeError if the number of new poses does not match the original and force_reset_df is False.

  • Logs warnings and information about the reset process, ensuring data integrity.

save_poses(out_path, poses_col='poses', overwrite=True)[source]

Saves the poses to a specified directory, with an option to overwrite existing files.

Parameters:
  • out_path (str) – The directory where the poses will be saved.

  • poses_col (str, optional) – The column in the DataFrame containing the pose file paths (default is ‘poses’).

  • overwrite (bool, optional) – If True, existing files in the target directory will be overwritten (default is True).

  • Details (Further)

  • ---------------

  • directory (This method saves the pose files to the specified directory. It copies the pose files from their current locations to the new)

  • overwritten. (ensuring that the directory structure is maintained. The overwrite parameter controls whether existing files in the target directory are)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Save poses to a new directory
poses_instance.save_poses(out_path='path/to/new_poses_dir', overwrite=False)

Notes

  • The method ensures that the target directory exists, creating it if necessary.

  • If overwrite is set to False, the method skips saving poses that already exist in the target directory.

  • Logs the saving process, including any skipped files due to the overwrite setting.

save_scores(out_path=None, out_format=None)[source]

Saves the scores DataFrame to a specified file path in the desired format.

Parameters:
  • out_path (str, optional) – The file path where the scores will be saved. If not provided, the default scorefile path is used.

  • out_format (str, optional) – The format in which to save the scores. If not provided, the default storage format is used.

  • Details (Further)

  • ---------------

  • necessary. (This method saves the scores DataFrame to the specified file path in the desired format. It ensures that the file name conforms to the specified format by appending the correct file extension if)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class with some scores
poses_instance = Poses()

# Save scores to a specific path in CSV format
poses_instance.save_scores(out_path='path/to/scores.csv', out_format='csv')

Notes

  • Supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.

  • The method automatically appends the correct file extension if it is not already present in the out_path.

  • Ensures that the scores are saved in a format suitable for further analysis and processing.

set_jobstarter(jobstarter)[source]

Configures the job starter for managing job submissions.

Parameters:

jobstarter (JobStarter) – An instance of the JobStarter class used to manage job submissions.

Return type:

None

Further Details

This method sets the job starter for the Poses class, which is used to manage job submissions in distributed computing environments. It allows the user to specify a custom job starter for handling computational tasks.

Example

from poses import Poses
from protflow.jobstarters import CustomJobStarter

# Initialize the Poses class
poses_instance = Poses()

# Set a custom job starter
custom_jobstarter = CustomJobStarter()
poses_instance.set_jobstarter(custom_jobstarter)

Notes

  • The job starter must be an instance of the JobStarter class or a subclass thereof.

  • This method enables customization of job management to suit specific computational workflows.

set_logger()[source]

Configures the logger for the Poses class.

Further Details

This method sets up the logging configuration for the Poses class. It creates a logger that outputs log messages to both the console and a log file in the working directory (if set). This aids in debugging and tracking the progress of data processing operations.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses(work_dir='path/to/work_dir')

# Set up the logger
poses_instance.set_logger()

Notes

  • The log file is named after the working directory and stored within it.

  • The logging level is set to INFO, and log messages include timestamps, logger names, log levels, and messages.

Return type:

None

set_motif(motif_col)[source]

Sets a motif column in the poses DataFrame for further analysis.

Parameters:

motif_col (str) – The column in the DataFrame containing the motifs to be set.

Raises:
  • KeyError – If the specified motif column is not found in the poses DataFrame.

  • TypeError – If the objects in the specified motif column are not of type ResidueSelection.

Return type:

None

Further Details

This method sets a column in the poses DataFrame to be used as motifs for further analysis. The motifs must be instances of the ResidueSelection class.

Example

from poses import Poses
from protflow.residues import ResidueSelection

# Initialize the Poses class with some poses
poses_instance = Poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

# Assume we have a column 'motifs' with ResidueSelection objects
poses_instance.set_motif('motifs')

Notes

  • The method ensures that the specified column exists and contains ResidueSelection objects.

  • Logs any errors encountered during the process for easier debugging and verification.

set_poses(poses=None, glob_suffix=None)[source]

Sets the poses for the Poses instance, parsing the input if necessary.

Parameters:
  • poses (Union[list, str, pd.DataFrame], optional) – A list of file paths, a directory containing the protein data files, or a DataFrame containing the poses. If not provided, an empty DataFrame is initialized.

  • glob_suffix (str, optional) – A suffix used for globbing multiple files in the specified directory.

  • Details (Further)

  • ---------------

  • types (This method initializes the poses for the Poses instance. It can accept various input)

  • paths (including a list of file)

  • files (a directory for globbing)

  • processing. (or a DataFrame. The method ensures that the poses are correctly parsed and set up for further)

Return type:

None

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set poses from a directory with a glob suffix
poses_instance.set_poses(poses='path/to/pose_dir', glob_suffix='*.pdb')

# Set poses from a list of file paths
poses_instance.set_poses(poses=['path/to/pose1.pdb', 'path/to/pose2.pdb'])

Notes

  • If a DataFrame is provided, it is directly used as the poses DataFrame after integrity checks.

  • The method supports parsing multiline FASTA inputs and handles them appropriately.

  • Ensures that the poses DataFrame contains necessary columns for subsequent operations.

set_scorefile(work_dir)[source]

Sets the scorefile path for storing protein scores.

Parameters:

work_dir (str) – The working directory where the scorefile will be stored. If the work directory is not set, the scorefile is stored in the current directory.

Return type:

None

scorefile

The path to the scorefile where protein scores are stored.

Type:

str

Notes

This method configures the path for the scorefile based on the provided working directory. If no working directory is specified, the scorefile is stored in the current directory.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the scorefile path
poses_instance.set_scorefile(work_dir='path/to/work_dir')
set_storage_format(storage_format)[source]

Sets the storage format for storing protein data.

Parameters:

storage_format (str) – The format used for storing protein data. Supported formats include ‘json’, ‘csv’, ‘pickle’, ‘feather’, and ‘parquet’.

Raises:

KeyError – If the provided storage format is not supported.

Return type:

None

Notes

This method configures the storage format for protein data. It ensures that the format is one of the supported formats and raises an error if the format is invalid.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the storage format to 'csv'
poses_instance.set_storage_format('csv')
set_work_dir(work_dir, set_scorefile=True)[source]

Sets up and configures the working directory for storing data and results.

Parameters:
  • work_dir (str) – The working directory where data and results will be stored. If the directory does not exist, it will be created.

  • set_scorefile (bool, optional) – If True, also sets the path for the scorefile in the specified working directory (default is True).

Return type:

None

Further Details

This method creates the necessary subdirectories within the specified working directory to organize score files, filter results, and plots. It ensures that the required directory structure is in place for subsequent data management operations.

Example

from poses import Poses

# Initialize the Poses class
poses_instance = Poses()

# Set the working directory
poses_instance.set_work_dir('path/to/new_work_dir')

Notes

  • The method will log the creation of directories if they do not already exist.

  • If set_scorefile is set to True, the scorefile path will be configured within the working directory.

split_multiline_fasta(path, encoding='UTF-8')[source]

Splits a multiline FASTA file into individual FASTA files, each containing a single sequence.

Parameters:
  • path (str) – The path to the multiline FASTA file.

  • encoding (str, optional) – The encoding of the FASTA file (default is “UTF-8”).

Returns:

  • list[str] – A list of file paths to the individual FASTA files.

  • Further Details

  • ---------------

  • This method reads a multiline FASTA file and splits it into individual FASTA files, each containing a single sequence. The individual FASTA files are stored in a subdirectory named 'input_fastas_split' within the working directory.

Return type:

list[str]

Example

from poses import Poses

# Initialize the Poses class with a working directory
poses_instance = Poses(work_dir='path/to/work_dir')

# Split a multiline FASTA file
individual_fasta_paths = poses_instance.split_multiline_fasta('path/to/multiline.fasta')

Notes

  • The method creates a subdirectory named ‘input_fastas_split’ within the working directory to store the individual FASTA files.

  • The descriptions in the FASTA file are sanitized to replace special characters with underscores.

  • Raises an AttributeError if the working directory is not set.

Parameters:
protflow.poses.class_in_df(df, cls, out_col)[source]

Return a copy of df with a column listing, for each row, the names of columns whose values are instances of a given class (or classes).

If no cells in the DataFrame match cls, the function returns a copy of df without adding out_col. Empty DataFrames are returned unchanged. Elementwise checks use pandas.DataFrame.map() (pandas ≥ 2.2).

Parameters:
  • df (pandas.DataFrame) – Input DataFrame to inspect.

  • cls (type or tuple[type, ]) – Class (or tuple of classes) to test against, as in isinstance(). Examples: dict or (dict, list).

  • out_col (str) – Name of the output column to add. Each entry will be a list[str] of column names whose values in that row are instances of cls. The column is only created if at least one match exists anywhere in df.

Returns:

A copy of df. If any matches are found, the copy contains an added column out_col with per-row lists of matching column names. If no matches are found (or df is empty), the copy is returned unchanged.

Return type:

pandas.DataFrame

Notes

  • This function does not mutate df; it returns a modified copy.

  • cls behaves exactly like the second argument to isinstance().

  • To convert the list results to a delimiter-separated string, you can post-process with: out[out_col] = out[out_col].apply('|'.join).

Examples

import pandas as pd
df = pd.DataFrame({
    'a': [1, {'x': 1}, 3],
    'b': [{'y': 2}, 5, [1, 2]],
    'c': ['hi', 'there', 'world'],
})

check_cols_for_class(df, dict, 'resselector_cols')
protflow.poses.col_in_df(df, column)[source]

Checks if the specified column(s) exist in the DataFrame.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be checked.

  • column (str or list[str]) – The column name or list of column names to check for existence in the DataFrame.

Raises:

KeyError – If any of the specified columns are not found in the DataFrame.

Return type:

None

Further Details

This function checks whether the specified column or list of columns exist in the given DataFrame. It is useful for ensuring that the DataFrame contains the necessary columns before performing further operations.

Example

import pandas as pd
from poses import col_in_df

# Create a sample DataFrame
df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6]
})

# Check if a column exists
col_in_df(df, 'col1')

# Check if multiple columns exist
col_in_df(df, ['col1', 'col2'])

Notes

  • The function raises a KeyError if any of the specified columns are not found in the DataFrame.

  • Ensures that the DataFrame contains the necessary columns for subsequent operations.

protflow.poses.combine_dataframe_score_columns(df, scoreterms, weights, scale=False)[source]

Combines multiple score columns in a DataFrame into a single composite score, applying weights and normalization.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing the score columns.

  • scoreterms (list[str]) – The list of score columns to be combined.

  • weights (list[float]) – The list of weights corresponding to each score column.

  • scale (bool, optional) – If True, scales the composite score to a range between 0 and 1 (default is False).

Returns:

The composite score as a pandas Series.

Return type:

pd.Series

Raises:
  • ValueError – If the number of scoreterms and weights do not match.

  • TypeError – If any score column contains non-numeric values.

Further Details

This function combines multiple score columns in a DataFrame into a single composite score. Each score column is normalized by subtracting the median and dividing by the standard deviation. The normalized scores are then weighted according to the specified weights and summed to create the composite score. Optionally, the composite score can be scaled to a range between 0 and 1.

Example

import pandas as pd
from poses import combine_dataframe_score_columns

# Create a sample DataFrame
data = {
    'score1': [10, 20, 30, 40, 50],
    'score2': [15, 25, 35, 45, 55]
}
df = pd.DataFrame(data)

# Combine score columns into a composite score
composite_score = combine_dataframe_score_columns(df, scoreterms=['score1', 'score2'], weights=[0.5, 0.5], scale=True)

Notes

  • The method ensures that the number of scoreterms and weights match.

  • Normalization helps in making the scores comparable by removing scale differences.

  • Raises a ValueError if the number of scoreterms and weights do not match, ensuring correct input.

  • The optional scaling step ensures that the composite score remains within a standardized range.

protflow.poses.description_from_path(path)[source]

Extracts “description” from a pose path.

Parameters:

path (str)

Return type:

str

protflow.poses.filter_dataframe_by_rank(df, col, n, group_col=None, remove_layers=None, layer_col='poses_description', sep='_', ascending=True)[source]

Filters the DataFrame to retain only the top-ranked rows based on a specified column.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be filtered.

  • col (str) – The column in the DataFrame used for ranking.

  • n (Union[float, int]) – The number of top-ranked rows to retain. If n < 1, it represents a fraction of the total rows.

  • group_col (str, optional) – Group dataframe by this column, then filter individual groups.

  • remove_layers (int, optional) – The number of layers to remove from the column values before ranking. This helps in grouping similar rows.

  • layer_col (str, optional) – The column used for layer-based grouping of rows (default is “poses_description”).

  • sep (str, optional) – The separator used in the layer descriptions (default is “_”).

  • ascending (bool, optional) – If True, ranks rows in ascending order; otherwise, in descending order (default is True).

Returns:

  • pd.DataFrame – The filtered DataFrame containing only the top-ranked rows.

  • Further Details

  • ---------------

  • This function filters the DataFrame to retain only the top-ranked rows based on the values in a specified column. It supports fractional ranking, layer-based grouping, and sorting in ascending or descending order. The function also allows for removing layers from column values before ranking to handle grouped data.

Return type:

DataFrame

Example

import pandas as pd
from poses import filter_dataframe_by_rank

# Create a sample DataFrame
data = {
    'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'],
    'score': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filter the DataFrame to retain the top 3 rows based on the score column
filtered_df = filter_dataframe_by_rank(df, col='score', n=3)

Notes

  • The function raises a KeyError if the specified column is not found in the DataFrame.

  • Ensures that the DataFrame is properly sorted and filtered based on the provided parameters.

protflow.poses.filter_dataframe_by_value(df, col, value, operator)[source]

Filters the DataFrame based on a specified value in a column using the provided comparison operator.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be filtered.

  • col (str) – The column in the DataFrame used for filtering.

  • value (Union[float, int]) – The value used as the threshold for filtering rows.

  • operator (str) – The comparison operator used for filtering (‘>’, ‘>=’, ‘<’, ‘<=’, ‘=’, ‘!=’).

Returns:

  • pd.DataFrame – The filtered DataFrame containing only the rows that meet the filtering criteria.

  • Further Details

  • ---------------

  • This function filters the DataFrame based on a specified value in a column, using the provided comparison operator. It supports various comparison operators such as greater than, less than, equal to, and not equal to.

Return type:

DataFrame

Example

import pandas as pd
from poses import filter_dataframe_by_value

# Create a sample DataFrame
data = {
    'poses_description': ['pose1', 'pose2', 'pose3', 'pose4', 'pose5'],
    'score': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filter the DataFrame to retain rows where the score is greater than 30
filtered_df = filter_dataframe_by_value(df, col='score', value=30, operator='>')

Notes

  • The function raises a KeyError if the specified column is not found in the DataFrame.

  • Ensures that the DataFrame is properly filtered based on the provided criteria.

protflow.poses.get_format(path)[source]

Returns the appropriate pandas function to load a file based on its extension.

Parameters:

path (str) – The path to the file whose format needs to be determined.

Returns:

  • function – The pandas function corresponding to the file format (e.g., pd.read_json, pd.read_csv).

  • Further Details

  • ---------------

  • This function determines the appropriate pandas function to use for loading a file based on its extension. It supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.

Example

import pandas as pd
from poses import get_format

# Determine the format function for a JSON file
load_function = get_format('path/to/data.json')

# Use the function to load the data
df = load_function('path/to/data.json')

Notes

  • Raises a KeyError if the file format is not supported.

  • Ensures that the appropriate pandas function is returned based on the file extension.

protflow.poses.load_poses(poses_path)[source]

Loads poses from a specified file and returns a Poses instance.

Parameters:

poses_path (str) – The path to the file containing the poses to be loaded.

Returns:

  • Poses – A Poses instance with poses loaded from the specified file.

  • Further Details

  • ---------------

  • This function reads a file containing poses and returns a Poses instance with the data. The file format is automatically detected based on the file extension, and the corresponding loading function is used to read the data into a DataFrame.

Return type:

Poses

Example

from poses import Poses, load_poses

# Load poses from a file
poses_instance = load_poses('path/to/poses.json')

Notes

  • The function supports various file formats, including JSON, CSV, Pickle, Feather, and Parquet.

  • Ensures that the loaded DataFrame contains the necessary columns and updates the Poses instance accordingly.

protflow.poses.normalize_series(ser, scale=False)[source]

Normalizes a pandas Series by subtracting the median and dividing by the standard deviation, with an option to scale the values.

Parameters:
  • ser (pd.Series) – The pandas Series to be normalized.

  • scale (bool, optional) – If True, scales the normalized values to a range between 0 and 1 (default is False).

Returns:

  • pd.Series – The normalized (and optionally scaled) Series.

  • Further Details

  • ---------------

  • This function normalizes a pandas Series by first subtracting the median and then dividing by the standard deviation. If the `scale parameter is set` to True, the normalized values are further scaled to a range between 0 and 1. This normalization process centers the data around zero and adjusts for variability, making the values comparable.

Return type:

Series

Example

import pandas as pd
from poses import normalize_series

# Create a sample pandas Series
sample_series = pd.Series([10, 20, 30, 40, 50])

# Normalize the Series
normalized_series = normalize_series(sample_series, scale=True)

Notes

  • If all values in the Series are the same, the function returns a Series of zeros.

  • The optional scaling step ensures that the values are adjusted to a standardized range.

protflow.poses.scale_series(ser)[source]

Scales a pandas Series to a range between 0 and 1.

Parameters:

ser (pd.Series) – The pandas Series to be scaled.

Returns:

  • pd.Series – The scaled Series with values between 0 and 1.

  • Further Details

  • ---------------

  • This function scales a pandas Series to a range between 0 and 1. It ensures that the minimum value in the Series becomes 0 and the maximum value becomes 1, with all other values adjusted proportionately.

Return type:

Series

Example

import pandas as pd
from poses import scale_series

# Create a sample pandas Series
sample_series = pd.Series([10, 20, 30, 40, 50])

# Scale the Series
scaled_series = scale_series(sample_series)

Notes

  • If all values in the Series are the same, the function returns a Series of zeros.

  • The scaling process adjusts the values to fit within a standardized range, making them comparable.

protflow.residues module

residues

The residues module is a part of the protflow package and is designed to handle residue selection and related operations in protein structures. This module provides functionality to parse, manipulate, and convert residue selections in various formats, making it an essential tool for bioinformatics and computational biology workflows.

The module includes the ResidueSelection class for representing and manipulating selections of residues, as well as various functions for parsing and converting residue selections.

Classes

  • ResidueSelection

    Represents a selection of residues with functionality for parsing, converting, and manipulating selections.

Functions

  • fast_parse_selection

    Fast parser for selections already in ResidueSelection format.

  • parse_selection

    Parses a selection into ResidueSelection formatted selection.

  • parse_residue

    Parses a single residue identifier into a tuple (chain, residue_index).

  • residue_selection

    Creates a ResidueSelection from a selection of residues.

  • from_dict

    Creates a ResidueSelection object from a dictionary specifying a motif.

  • from_contig

    Creates a ResidueSelection object from a contig string.

  • reduce_to_unique

    Reduces an input array to its unique elements while preserving order.

Example Usage

Creating and manipulating ResidueSelection objects:

from residues import ResidueSelection, from_dict, from_contig

# Create a ResidueSelection from a list
selection = ResidueSelection(["A1", "A2", "B3"])

# Convert to string
selection_str = selection.to_string()
print(selection_str)
# Output: A1, A2, B3

# Convert to dictionary
selection_dict = selection.to_dict()
print(selection_dict)
# Output: {'A': [1, 2], 'B': [3]}

# Create a ResidueSelection from a dictionary
selection_from_dict = from_dict({"A": [1, 2], "B": [3]})
print(selection_from_dict.to_string())
# Output: A1, A2, B3

# Create a ResidueSelection from a contig string
selection_from_contig = from_contig("A1-A3, B5")
print(selection_from_contig.to_string())
# Output: A1, A2, A3, B5

This module simplifies the process of handling residue selections in bioinformatics workflows, providing a consistent interface for different types of input and output formats.

class protflow.residues.ResidueSelection(selection=None, delim=',', fast=False, from_scorefile=False)[source]

Bases: object


A class to represent selections of residues in protein structures. A selection of residues is represented as a tuple with the hierarchy ((chain, residue_idx), …).

Parameters:
  • selection (list, optional) – A list of residues in string format, e.g., [“A1”, “A2”, “B3”]. Default is None.

  • delim (str, optional) – The delimiter used to parse the selection string. Default is “,”.

  • fast (bool, optional) – If True, parses the selection without any type checking. Use when selection is already in ResidueSelection format. Default is False.

  • from_scorefile (bool)

residues

A tuple representing the parsed residues selection.

Type:

tuple

Examples

>>> from residues import ResidueSelection
>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_string())
A1, A2, B3
>>> print(selection.to_dict())
{'A': [1, 2], 'B': [3]}
from_selection(selection)[source]

Constructs a ResidueSelection instance from the provided selection.

Parameters:

selection (list or str) – The selection of residues to be parsed.

Returns:

A new ResidueSelection instance.

Return type:

ResidueSelection

to_dict()[source]

Converts the ResidueSelection to a dictionary.

Note

Converting to a dictionary destroys the ordering of specific residues on the same chain in a motif.

Returns:

A dictionary representation of the ResidueSelection with chains as keys and lists of residue indices as values.

Return type:

dict

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_dict())
{'A': [1, 2], 'B': [3]}
to_list(ordering=None)[source]

Converts the ResidueSelection to a list of strings.

Parameters:

ordering (str, optional) – Specifies the ordering of the residues in the output list. Options are “rosetta” or “pymol”. Default is None.

Returns:

The list representation of the ResidueSelection.

Return type:

list of str

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_list())
['A1', 'A2', 'B3']
>>> print(selection.to_list(ordering="rosetta"))
['1A', '2A', '3B']
to_rfdiffusion_contig()[source]

Parses ResidueSelection object to contig string for RFdiffusion.

Example

If self.residues = ((“A”, 1), (“A”, 2), (“A”, 3), (“C”, 4), (“C”, 6)), the output will be “A1-3,C4,C6”.

Return type:

str

to_string(delim=',', ordering=None)[source]

Converts the ResidueSelection to a string.

Parameters:
  • delim (str, optional) – The delimiter to use in the resulting string. Default is “,”.

  • ordering (str, optional) – Specifies the ordering of the residues in the output string. Options are “rosetta” or “pymol”. Default is None.

Returns:

ResidueSelection object formatted as a string, separated by :delim: ueSelection.

Return type:

str

Examples

>>> selection = ResidueSelection(["A1", "A2", "B3"])
>>> print(selection.to_string())
A1, A2, B3
>>> print(selection.to_string(ordering="rosetta"))
1A, 2A, 3B
protflow.residues.fast_parse_selection(input_selection)[source]

Fast selection parser for pre-formatted selections.

This function is a fast parser for residue selections that are already in the ResidueSelection format. It bypasses any additional type checking or parsing to improve performance when the input is guaranteed to be correctly formatted.

Parameters:

input_selection (tuple of tuple of (str, int)) – A tuple of tuples where each inner tuple represents a residue with the format (chain, residue_index).

Returns:

The input selection, unchanged.

Return type:

tuple of tuple of (str, int)

Examples

>>> input_selection = (("A", 1), ("B", 2), ("C", 3))
>>> fast_parse_selection(input_selection)
(('A', 1), ('B', 2), ('C', 3))
protflow.residues.from_contig(input_contig)[source]

Creates a ResidueSelection object from a contig string.

This function constructs a ResidueSelection instance from a contig string. The contig string can specify ranges of residues using a hyphen (-) to denote the range, with residues separated by commas (,). For example, “A1-A3, B5” specifies residues A1, A2, A3, and B5.

Parameters:

input_contig (str) – A contig string specifying the residues. Ranges can be denoted using hyphens, and residues are separated by commas.

Returns:

An instance of the ResidueSelection class representing the parsed selection of residues.

Return type:

ResidueSelection

Examples

>>> from_contig("A1-A3, B5")
<ResidueSelection object representing ('A', 1), ('A', 2), ('A', 3), ('B', 5)>
>>> from_contig("C1, C3-C5, D2")
<ResidueSelection object representing ('C', 1), ('C', 3), ('C', 4), ('C', 5), ('D', 2)>
protflow.residues.from_dict(input_dict)[source]

Creates a ResidueSelection object from a dictionary.

This function constructs a ResidueSelection instance from a dictionary where the keys represent chain identifiers and the values are lists of residue indices. This format specifies a motif in the following way: {chain: [residues], …}.

Parameters:

input_dict (dict) – A dictionary specifying the motif. The keys are chain identifiers (str) and the values are lists of residue indices (int).

Returns:

An instance of the ResidueSelection class representing the parsed selection of residues.

Return type:

ResidueSelection

Examples

>>> input_dict = {"A": [1, 2], "B": [3, 4]}
>>> from_dict(input_dict)
<ResidueSelection object representing ('A', 1), ('A', 2), ('B', 3), ('B', 4)>
protflow.residues.parse_from_scorefile(input_selection)[source]
Parameters:

input_selection (dict)

Return type:

tuple[tuple[str, int]]

protflow.residues.parse_residue(residue_identifier)[source]

Parses a single residue identifier into a tuple (chain, residue_index).

This function takes a residue identifier string and parses it into a tuple containing the chain identifier and the residue index. It currently only supports single-letter chain identifiers.

Parameters:

residue_identifier (str) – A string representing the residue identifier. The format is expected to be either “chain+residue_index” or “residue_index+chain”, where “chain” is a single letter and “residue_index” is an integer.

Returns:

A tuple containing the chain identifier and the residue index.

Return type:

tuple of (str, int)

Examples

>>> parse_residue("A123")
('A', 123)
>>> parse_residue("123A")
('A', 123)

Notes

  • The function determines whether the chain identifier is at the beginning or the end of the string based on whether the first character is a digit.

  • Only single-letter chain identifiers are supported.

protflow.residues.parse_selection(input_selection, delim=',', fast=False, from_scorefile=False)[source]

Parses a selection into ResidueSelection formatted selection.

This function takes a selection of residues in various formats and parses it into the ResidueSelection format, which is a tuple of tuples. Each inner tuple represents a residue with the format (chain, residue_index).

Parameters:
  • input_selection (str, list, or tuple) – The selection of residues to be parsed. This can be: - A string with residues separated by a delimiter. - A list or tuple of residue strings. - A list or tuple of lists/tuples, where each inner list/tuple represents a residue.

  • delim (str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.

  • fast (bool, optional) – If True, uses fast_parse_selection to bypass type checking and parsing for performance reasons. Use when input_selection is already in the correct format. Default is False.

  • from_scorefile (bool, optional) – If True, parses a residue selection that was read in from a scorefile (in the form {‘residues’: [[‘A’, 1], [‘B’, 3]}). Default is False.

Returns:

A tuple of tuples where each inner tuple represents a residue in the format (chain, residue_index).

Return type:

tuple of tuple of (str, int)

Raises:

TypeError – If input_selection is not a supported type (str, list, or tuple).

Examples

>>> parse_selection("A1, B2, C3")
(('A', 1), ('B', 2), ('C', 3))
>>> parse_selection(["A1", "B2", "C3"])
(('A', 1), ('B', 2), ('C', 3))
>>> parse_selection([["A", 1], ["B", 2], ["C", 3]])
(('A', 1), ('B', 2), ('C', 3))
>>> parse_selection([("A", 1), ("B", 2), ("C", 3)], fast=True)
(('A', 1), ('B', 2), ('C', 3))
protflow.residues.reduce_to_unique(input_array)[source]

Reduces an input array to its unique elements while preserving order.

This function takes a list or tuple and returns a new list or tuple containing only the unique elements from the input, with their original order preserved. The type of the returned collection matches the type of the input.

Parameters:

input_array (list or tuple) – The input array from which to remove duplicate elements. The order of the elements is preserved.

Returns:

A new list or tuple containing only the unique elements from the input array, with the original order preserved.

Return type:

list or tuple

Examples

>>> reduce_to_unique([1, 2, 2, 3, 1])
[1, 2, 3]
>>> reduce_to_unique(("a", "b", "a", "c", "b"))
('a', 'b', 'c')

Notes

  • The function uses OrderedDict.fromkeys to remove duplicates while preserving order.

  • The returned collection is of the same type as the input (list or tuple).

protflow.residues.residue_selection(input_selection, delim=',')[source]

Creates a ResidueSelection from a selection of residues.

This function takes an input selection of residues in various formats and creates a ResidueSelection object. The selection can be provided as a string, list, or tuple.

Parameters:
  • input_selection (str, list, or tuple) –

    The selection of residues to be parsed. This can be:
    • A string with residues separated by a delimiter.

    • A list or tuple of residue strings.

    • A list or tuple of lists/tuples, where each inner list/tuple represents a residue.

  • delim (str, optional) – The delimiter used to split the input string if input_selection is a string. Default is “,”.

Returns:

An instance of the ResidueSelection class representing the parsed selection of residues.

Return type:

ResidueSelection

Examples

>>> residue_selection("A1, B2, C3")
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>
>>> residue_selection(["A1", "B2", "C3"])
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>
>>> residue_selection([["A", 1], ["B", 2], ["C", 3]])
<ResidueSelection object representing ('A', 1), ('B', 2), ('C', 3)>

protflow.runners module

runners module

This module provides functionality for handling the interaction between runners and poses in protein data processing workflows.

It includes classes and utility functions to:

  • Manage the output from runner processes.

  • Define abstract runner interfaces.

  • Parse and manage command-line options and flags for runner processes.

Dependencies:

  • builtins: logging, os, re

  • pandas

  • protflow.poses: Poses, get_format, FORMAT_STORAGE_DICT

  • protflow.jobstarters: JobStarter

Overview:

The runners module is designed to facilitate the integration of various runner processes with protein pose data, ensuring consistent data formatting, error handling, and integration of results into the Poses class. Utility functions provided in this module support the parsing and handling of command-line options and flags, making it easier to configure and execute runner processes in a flexible manner.

Notes

This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author

Markus Braun, Adrian Tripp

Version

0.1.0

class protflow.runners.Runner[source]

Bases: object

Abstract Runner base class

The Runner class provides an abstract base for defining runners that handle the interface between runner processes and the Poses class. It includes methods for running jobs, checking paths, verifying prefixes, preparing pose options, and managing job setup and score files.

Examples

To create a custom runner, subclass Runner and implement the abstract methods:

>>> class MyRunner(Runner):
>>>     def __str__(self):
>>>         return "MyRunner"
>>>
>>>     def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput:
>>>         # Custom implementation for running jobs
>>>         pass

Example usage:

>>> my_runner = MyRunner()
>>> poses = Poses()
>>> jobstarter = JobStarter()
>>> runner_output = my_runner.run(poses, "example_prefix", jobstarter)
exception CrashError[source]

Bases: RuntimeError

Re-raised error with job stderr context when collect_scores fails.

classmethod __init_subclass__(**kwargs)[source]

overwrites subclasses to check for exceptions

__str__()[source]

Abstract method to provide the name of the runner.

This method should be overridden in subclasses to return the name of the runner.

Raises:

NotImplementedError – If the method is not overridden in the subclass.

Examples

>>> class MyRunner(Runner):
>>>     def __str__(self):
>>>         return "MyRunner"
check_for_existing_scorefile(scorefile, overwrite=False)[source]

Checks if a scorefile exists and returns it as a DataFrame if overwrite is False.

Parameters:
  • scorefile (str) – The path to the scorefile.

  • overwrite (bool, optional) – Whether to overwrite the scorefile if it exists (default is False).

Returns:

The scorefile as a DataFrame if it exists and overwrite is False. None otherwise.

Return type:

pandas.DataFrame

Examples

>>> runner = MyRunner()
>>> scores_df = runner.check_for_existing_scorefile("/path/to/scorefile.csv")
check_for_prefix(prefix, poses)[source]

Checks if a column with the given prefix already exists in the Poses DataFrame.

Parameters:
  • prefix (str) – The prefix to be checked.

  • poses (Poses) – An instance of the Poses class whose DataFrame will be checked.

Raises:

KeyError – If a column with the given prefix already exists in the Poses DataFrame.

Return type:

None

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> runner.check_for_prefix("example_prefix", poses)
generic_run_setup(poses, prefix, jobstarters, make_work_dir=True)[source]

Sets up the runner’s working directory and jobstarter.

Checks if the prefix exists in poses.df, sets up a jobstarter, and creates the working directory if necessary.

Returns absolute path to working directory and the jobstarter that will be used for the runner.

Parameters:
  • poses (Poses) – An instance of the Poses class.

  • prefix (str) – The prefix to be used for the setup.

  • jobstarters (list[JobStarter]) – A list of JobStarter instances to choose from.

  • make_work_dir (bool, optional) – Whether to create the working directory if it does not exist (default is True).

  • Note (Order of jobstarters in :jobstarter: parameter is: [Runner.run(jobstarter), Runner.jobstarter, poses.default_jobstarter])

Returns:

A tuple containing the path to the working directory and the selected JobStarter instance.

Return type:

tuple[str, JobStarter]

Raises:

ValueError – If no valid JobStarter is set.

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> jobstarters = [JobStarter(), JobStarter(), JobStarter()]
>>> work_dir, jobstarter = runner.generic_run_setup(poses, "example_prefix", jobstarters)
prep_pose_options(poses, pose_options=None)[source]

Prepares pose options, ensuring they are of the same length as the poses.

Parameters:
  • poses (Poses) – An instance of the Poses class.

  • pose_options (list[str], optional) – A list of pose options to be prepared. If not provided, an empty list will be used.

Returns:

A list of prepared pose options.

Return type:

list

Raises:

ValueError – If the length of pose_options does not match the length of poses.

Examples

>>> runner = MyRunner()
>>> poses = Poses()
>>> prepared_options = runner.prep_pose_options(poses, ["option1", "option2"])
run(poses, prefix, jobstarter)[source]

Abstract method to run jobs and send scores to Poses.

This method should be overridden in subclasses to define the job execution logic and integrate the results into the Poses class.

Parameters:
  • poses (Poses) – An instance of the Poses class to be processed.

  • prefix (str) – Prefix to be added to the results columns.

  • jobstarter (JobStarter) – An instance of the JobStarter class to handle job execution.

Returns:

An instance of the RunnerOutput class containing the processed results.

Return type:

RunnerOutput

Raises:

NotImplementedError – If the method is not overridden in the subclass.

Examples

>>> class MyRunner(Runner):
>>>     def run(self, poses: Poses, prefix: str, jobstarter: JobStarter) -> RunnerOutput:
>>>         # Custom implementation for running jobs
>>>         pass
save_runner_scorefile(scores, scorefile)[source]

Saves the runner’s scorefile based on the file extension format.

Parameters:
  • scores (pandas.DataFrame) – The DataFrame containing the scores to be saved.

  • scorefile (str) – The path to the scorefile to be saved.

Raises:

KeyError – If the file extension format is not recognized.

Return type:

None

Examples

>>> runner = MyRunner()
>>> scores_df = pd.DataFrame({'score': [1, 2, 3]})
>>> runner.save_runner_scorefile(scores_df, "/path/to/scorefile.csv")
search_path(input_path, path_name, is_dir=False)[source]

Checks if a given path exists and is valid.

Parameters:
  • input_path (str) – The path to be checked.

  • path_name (str) – The name associated with the path, used for error messages.

  • is_dir (bool)

Returns:

The validated path.

Return type:

str

Raises:

ValueError – If the path is not set or does not exist on the local filesystem.

Examples

>>> runner = MyRunner()
>>> valid_path = runner.search_path("/path/to/file", "example_path")
class protflow.runners.RunnerOutput(poses, results, prefix, index_layers=0, index_sep='_')[source]

Bases: object

RunnerOutput class

The RunnerOutput class handles how protein data is passed between Runner and Poses classes. It ensures the correct formatting of results and facilitates the integration of runner outputs into the Poses data structure.

param poses:

An instance of the Poses class.

type poses:

Poses

param results:

A DataFrame containing the results to be checked and formatted. The DataFrame must contain ‘description’ and ‘location’ columns.

type results:

pandas.DataFrame

param prefix:

A prefix to be added to the results columns.

type prefix:

str

param index_layers:

Number of index layers to remove from the ‘description’ column (default is 0).

type index_layers:

int, optional

param index_sep:

Separator used in the index (default is “_”).

type index_sep:

str, optional

check_data_formatting(results)[source]

Checks if the input DataFrame has the correct format.

Parameters:

results (pandas.DataFrame) – The input DataFrame to be checked. It must contain ‘description’ and ‘location’ columns.

Returns:

The validated and formatted DataFrame.

Return type:

pandas.DataFrame

Raises:

ValueError – If the input DataFrame does not contain the required columns or if the ‘description’ column does not match the ‘location’ column.

return_poses()[source]

Integrates the output of a runner into a Poses class.

This method adds the output of a Runner class formatted in RunnerOutput into Poses.df and returns the updated Poses instance.

Returns:

The updated Poses instance with the integrated runner output.

Return type:

Poses

Raises:

ValueError – If merging DataFrames fails due to no overlap between Poses.df[‘poses_description’] and results[new_df_col] or if some rows in results[new_df_col] were not found in Poses.df[‘poses_description’].

Parameters:
protflow.runners.col_in_df(df, column)[source]

Checks if a column exists in a DataFrame.

This function verifies whether a specified column is present in the given DataFrame. If the column is not found, it raises a KeyError.

Parameters:
  • df (pandas.DataFrame) – The DataFrame to be checked.

  • column (str) – The name of the column to be verified.

Raises:

KeyError – If the specified column is not found in the DataFrame.

Return type:

None

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> col_in_df(df, 'A')  # No error raised
>>> col_in_df(df, 'C')  # Raises KeyError
Traceback (most recent call last):
    ...
KeyError: 'Could not find C in poses dataframe! Are you sure you provided the right column name?'
protflow.runners.expand_options_flags(options_str, sep='--')[source]

Simple parsing function to parse options and flags from an input string.

Splits an input string into options and flags only based on a specified separator! If your command has more complex patterns in its options, then switch to “regex_expand_options_flags”. Options are key-value pairs, while flags are standalone keys without values.

Parameters:
  • options_str (str) – The input string containing options and flags to be parsed.

  • sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).

Returns:

A tuple containing a dictionary of options and a set of flags.

Return type:

tuple[dict, set]

Examples

>>> options_str = "--width 800 --height 600 --verbose"
>>> opts, flags = expand_options_flags(options_str)
>>> print(opts)
{'width': '800', 'height': '600'}
>>> print(flags)
{'verbose'}
>>> options_str = "--color=blue --debug --timeout=30"
>>> opts, flags = expand_options_flags(options_str)
>>> print(opts)
{'color': 'blue', 'timeout': '30'}
>>> print(flags)
{'debug'}
protflow.runners.options_flags_to_string(options, flags, sep='--', no_quotes=False)[source]

Converts options dictionary and flags list into a single string.

This function combines a dictionary of options and a list of flags into a single command-line style string.

Parameters:
  • options (dict) – A dictionary of options, where keys are option names and values are option values.

  • flags (list) – A list of flags (standalone options without values).

  • sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).

  • no_quotes (bool, optional) – (default: False) Setting this option to True will disable the quoting of commandline arguments that are separated by whitespaces. For example, if your option is “–my_list=’1 4 6 14’” then you’d want your list quoted. setting no_quotes=True would result in “–my_list=1 4 6 14”, which can cause errors.

Returns:

A string representation of the combined options and flags.

Return type:

str

Examples

>>> options = {'width': '800', 'height': '600'}
>>> flags = ['verbose', 'debug']
>>> options_flags_to_string(options, flags)
" --width=800 --height=600 --verbose --debug"
>>> options = {'color': 'dark blue', 'timeout': '30'}
>>> flags = ['force']
>>> options_flags_to_string(options, flags)
" --color='dark blue' --timeout=30 --force"
protflow.runners.parse_generic_options(options, pose_options, sep='--')[source]

Parses generic options and pose-specific options from two input strings, combining them into a single dictionary of options and a list of flags. Pose-specific options overwrite generic options in case of conflicts. Options are expected to be separated by a specified separator within each input string, with options and their values separated by spaces.

Parameters:

optionsstr

A string of generic options, where different options are separated by the specified separator and each option’s value (if any) is separated by space.

pose_optionsstr

A string of pose-specific options, formatted like the options parameter. These options take precedence over generic options.

sepstr, optional

The separator used to distinguish between different options in both input strings. Defaults to “–“.

Returns:

tuple

A 2-element tuple where the first element is a dictionary of merged options (key-value pairs) and the second element is a list of unique flags (options without values) from both input strings.

Examples:

>>> parse_generic_options("--width 800 --height 600", "--color blue --verbose")
({'width': '800', 'height': '600', 'color': 'blue'}, ['verbose'])

This function internally utilizes a helper function expand_options_flags to process each input string separately before merging the results, ensuring that pose-specific options and flags are appropriately prioritized and duplicates are removed.

Parameters:
  • options (str)

  • pose_options (str)

Return type:

tuple[dict, list]

protflow.runners.prepend_cmd(cmds, pre_cmd)[source]

Prepends a single command to all commands in a list.

Parameters:
  • cmds (list[str]) – A list of commands, where all elements are strings.

  • pre_cmd (str) – A string containing a command, which should be prepended to all commands in the commands list.

Returns:

A list of all commands with the additional command prepended to each.

Return type:

list[str]

Examples

>>> cmds = [run_inference.sh pose_0001.pdb, run_inference.sh pose_0002.pdb]
>>> pre_cmd = "conda init"
>>> prepend_cmd(cmds, pre_cmd)
"['conda init; run_inference.sh pose_0001.pdb', 'conda init; run_inference.sh pose_0002.pdb']"
protflow.runners.regex_expand_options_flags(options_str, sep='--')[source]

Parses options and flags from an input string using regular expressions.

This function uses regular expressions to split an input string into options and flags. It ensures that separators within quotes are not split.

Parameters:
  • options_str (str) – The input string containing options and flags to be parsed.

  • sep (str, optional) – The separator used to distinguish different options and flags (default is “–“).

Returns:

A tuple containing a dictionary of options and a set of flags.

Return type:

tuple[dict, set]

Examples

>>> options_str = '--width 800 --height 600 --verbose'
>>> opts, flags = regex_expand_options_flags(options_str)
>>> print(opts)
{'width': '800', 'height': '600'}
>>> print(flags)
{'verbose'}
>>> options_str = '--color="dark blue" --debug --timeout=30'
>>> opts, flags = regex_expand_options_flags(options_str)
>>> print(opts)
{'color': 'dark blue', 'timeout': '30'}
>>> print(flags)
{'debug'}

Module contents

Package initialization

protflow.get_config()[source]
Return type:

object

protflow.require_config()[source]

Default function to be called in runners to require a set-up config.py file. This function imports and returns protflow.config

Return type:

object