Source code for protflow.tools.ligandmpnn

"""
LigandMPNN Module
=================

This module provides the functionality to integrate LigandMPNN within the ProtFlow framework. It offers tools to run LigandMPNN, handle its inputs and outputs, and process the resulting data in a structured and automated manner.

Detailed Description
--------------------
The `LigandMPNN` class encapsulates the functionality necessary to execute LigandMPNN runs. It manages the configuration of paths to essential scripts and Python executables, sets up the environment, and handles the execution of the diffusion processes. It also includes methods for collecting and processing output data, ensuring that the results are organized and accessible for further analysis within the ProtFlow ecosystem.

The module is designed to streamline the integration of LigandMPNN into larger computational workflows. It supports the automatic setup of job parameters, execution of LigandMPNN commands, and parsing of output files into a structured DataFrame format. This facilitates subsequent data analysis and visualization steps.

Usage
-----
To use this module, create an instance of the `LigandMPNN` class and invoke its `run` method with appropriate parameters. The module will handle the configuration, execution, and result collection processes. Detailed control over the process is provided through various parameters, allowing for customized runs tailored to specific research needs.

Examples
--------
Here is an example of how to initialize and use the `LigandMPNN` class within a ProtFlow pipeline:

.. code-block:: python

    from protflow.poses import Poses
    from protflow.jobstarters import JobStarter
    from ligandmpnn import LigandMPNN

    # Create instances of necessary classes
    poses = Poses()
    jobstarter = JobStarter()

    # Initialize the LigandMPNN class
    ligandmpnn = LigandMPNN()

    # Run the diffusion process
    results = ligandmpnn.run(
        poses=poses,
        prefix="experiment_1",
        jobstarter=jobstarter,
        nseq=10,
        model_type="ligand_mpnn",
        options="some_option=some_value",
        pose_options=["pose_option=pose_value"],
        overwrite=True
    )

    # Access and process the results
    print(results)

Further Details
---------------
- Edge Cases: The module handles various edge cases, such as empty pose lists and the need to overwrite previous results. It ensures robust error handling and logging for easier debugging and verification of the process.
- Customizability: Users can customize the process through multiple parameters, including the number of sequences, specific options for the LigandMPNN script, and options for handling pose-specific parameters.
- Integration: The module seamlessly integrates with other components of the ProtFlow framework, leveraging shared configurations and data structures to provide a cohesive user experience.

This module is intended for researchers and developers who need to incorporate LigandMPNN into their protein design and analysis workflows. By automating many of the setup and execution steps, it allows users to focus on interpreting results and advancing their scientific inquiries.

Notes
-----
This module is part of the ProtFlow package and is designed to work in tandem with other components of the package, especially those related to job management in HPC environments.

Author
------
Markus Braun, Adrian Tripp

Version
-------
0.1.0
"""
# general imports
import json
import os
import logging
from glob import glob
import shutil
from typing import Union

# dependencies
import pandas as pd
import Bio
import Bio.SeqIO

# custom
from protflow import require_config, load_config_path, jobstarters
from protflow.residues import ResidueSelection
from protflow.poses import Poses
from protflow.jobstarters import JobStarter
from protflow.runners import Runner, RunnerOutput, regex_expand_options_flags, parse_generic_options, col_in_df, options_flags_to_string, prepend_cmd

LIGANDMPNN_CHECKPOINT_DICT = {
    "protein_mpnn": "/model_params/proteinmpnn_v_48_020.pt",
    "ligand_mpnn": "/model_params/ligandmpnn_v_32_010_25.pt",
    "per_residue_label_membrane_mpnn": "/model_params/per_residue_label_membrane_mpnn_v_48_020.pt",
    "global_label_membrane_mpnn": "/model_params/global_label_membrane_mpnn_v_48_020.pt",
    "soluble_mpnn": "/model_params/solublempnn_v_48_020.pt"
}


[docs]
class LigandMPNN(Runner):
    """
    LigandMPNN Class
    ================

    The `LigandMPNN` class provides the necessary methods to execute LigandMPNN runs within the ProtFlow framework. This class is responsible for managing the configuration, execution, and output processing of LigandMPNN tasks.

    Detailed Description
    --------------------
    The `LigandMPNN` class integrates LigandMPNN into the ProtFlow pipeline by setting up the environment, running the diffusion process, and collecting the results. It ensures that the inputs and outputs are handled efficiently, making the data readily available for further analysis.

    Key Features:
    - Manages paths to essential scripts and executables.
    - Configures and executes LigandMPNN processes.
    - Collects and processes output data into a structured DataFrame format.
    - Handles various edge cases and supports custom configurations through multiple parameters.

    Usage
    -----
    To use this class, initialize it with the appropriate script and Python paths, along with an optional job starter. The main functionality is provided through the `run` method, which requires parameters such as poses, prefix, and additional options for customization.

    Example
    -------
    .. code-block:: python

        from protflow.poses import Poses
        from protflow.jobstarters import JobStarter
        from ligandmpnn import LigandMPNN

        # Create instances of necessary classes
        poses = Poses()
        jobstarter = JobStarter()

        # Initialize the LigandMPNN class
        ligandmpnn = LigandMPNN()

        # Run the diffusion process
        results = ligandmpnn.run(
            poses=poses,
            prefix="experiment_1",
            jobstarter=jobstarter,
            nseq=10,
            model_type="ligand_mpnn",
            options="some_option=some_value",
            pose_options=["pose_option=pose_value"],
            overwrite=True
        )

        # Access and process the results
        print(results)

    Notes
    -----
    This class is designed to work within the ProtFlow framework and assumes that the necessary configurations and dependencies are properly set up. It leverages shared data structures and configurations from ProtFlow to provide a seamless integration experience.

    Author
    ------
    Markus Braun, Adrian Tripp

    Version
    -------
    0.1.0
    """

[docs]
    def __init__(self, script_path: str|None = None, python_path: str|None = None, pre_cmd: str|None = None, jobstarter: JobStarter = None) -> None:
        """
        Initializes the LigandMPNN class.

        Parameters:
            script_path (str, optional): The path to the LigandMPNN script. Defaults to the configured script path in ProtFlow.
            python_path (str, optional): The path to the Python executable to run the LigandMPNN script. Defaults to the configured Python path in ProtFlow.
            jobstarter (JobStarter, optional): An instance of the JobStarter class to manage job submissions. If not provided, it will use the default job starter configuration.

        Detailed Description
        --------------------
        The `__init__` method sets up the necessary paths and configurations for running LigandMPNN. It searches for the provided script and Python
        paths to ensure they are correct and sets them as instance attributes. Additionally, it initializes the job starter, which manages the execution
        of jobs in high-performance computing (HPC) environments. This method ensures that all configurations are correctly set up before running any
        LigandMPNN tasks.
        """
        # setup config
        config = require_config()
        self.script_path = script_path or load_config_path(config, "LIGANDMPNN_SCRIPT_PATH")
        self.python_path = python_path or load_config_path(config, "LIGANDMPNN_PYTHON_PATH")
        self.pre_cmd = pre_cmd or load_config_path(config, "LIGANDMPNN_PRE_CMD", is_pre_cmd=True)

        # setup runner
        self.name = "ligandmpnn.py"
        self.index_layers = 1
        self.jobstarter = jobstarter


    def __str__(self):
        return "ligandmpnn.py"


[docs]
    def run(self, poses: Poses, prefix: str, jobstarter: JobStarter = None, nseq: int = 1, model_type: str = None, options: str = None, pose_options: object = None, fixed_res_col: str = None, design_res_col: str = None, pose_opt_cols: dict = None, return_seq_threaded_pdbs_as_pose: bool = False, preserve_original_output: bool = False, overwrite: bool = False) -> Poses:
        """
        Execute the LigandMPNN process with given poses and jobstarter configuration.

        This method sets up and runs the LigandMPNN process using the provided poses and jobstarter object. It handles the configuration, execution, and collection of output data, ensuring that the results are organized and accessible for further analysis.

        Parameters:
            poses (Poses): The Poses object containing the protein structures.
            prefix (str): A prefix used to name and organize the output files.
            jobstarter (JobStarter, optional): An instance of the JobStarter class, which manages job execution. Defaults to None.
            nseq (int, optional): The number of sequences to generate for each input pose. Defaults to 1.
            model_type (str, optional): The type of model to use. Defaults to 'ligand_mpnn'.
            options (str, optional): Additional options for the LigandMPNN script. Defaults to None.
            pose_options (object, optional): Pose-specific options for the LigandMPNN script. Defaults to None.
            fixed_res_col (str, optional): Column name in the poses DataFrame specifying fixed residues. Defaults to None.
            design_res_col (str, optional): Column name in the poses DataFrame specifying residues to be redesigned. Defaults to None.
            pose_opt_cols (dict, optional): Dictionary of pose-specific options for the LigandMPNN script. Defaults to None.
            return_seq_threaded_pdbs_as_pose (bool, optional): If True, return sequence-threaded PDBs as poses. Defaults to False.
            preserve_original_output (bool, optional): If True, preserve the original output files. Defaults to True.
            overwrite (bool, optional): If True, overwrite existing output files. Defaults to False.

        Returns:
            Poses: The updated Poses object containing the results of the LigandMPNN process.

        Raises:
            FileNotFoundError: If required files or directories are not found during the execution process.
            ValueError: If invalid arguments are provided to the method.

        Examples:
            Here is an example of how to use the `run` method:

            .. code-block:: python

                from protflow.poses import Poses
                from protflow.jobstarters import JobStarter
                from ligandmpnn import LigandMPNN

                # Create instances of necessary classes
                poses = Poses()
                jobstarter = JobStarter()

                # Initialize the LigandMPNN class
                ligandmpnn = LigandMPNN()

                # Run the diffusion process
                results = ligandmpnn.run(
                    poses=poses,
                    prefix="experiment_1",
                    jobstarter=jobstarter,
                    nseq=10,
                    model_type="ligand_mpnn",
                    options="some_option=some_value",
                    pose_options=["pose_option=pose_value"],
                    overwrite=True
                )

                # Access and process the results
                print(results)

        Further Details:
            - **Setup and Execution:** The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.
            - **Output Management:** The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.
            - **Customization:** Extensive customization options are provided through parameters, allowing users to tailor the process to their specific needs.

        This method is designed to streamline the execution of LigandMPNN processes within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein design simulations.
        """
        self.index_layers = 1

        # integrate redesigned and fixed residue parameters into pose_opt_cols:
        pose_opt_cols = pose_opt_cols or {}
        if fixed_res_col is not None:
            pose_opt_cols["fixed_residues"] = fixed_res_col
        if design_res_col is not None:
            pose_opt_cols["redesigned_residues"] = design_res_col

        # run in batch mode if pose_options are not set:
        run_batch = self.check_for_batch_run(pose_options, pose_opt_cols)
        if run_batch:
            logging.info(f"Setting up ligandmpnn run {prefix} for batched design.")

        # check if sidechain packing was specified in options
        pack_sidechains = "pack_side_chains" in options if options else False

        # setup runner
        work_dir, jobstarter = self.generic_run_setup(
            poses=poses,
            prefix=prefix,
            jobstarters=[jobstarter, self.jobstarter, poses.default_jobstarter]
        )

        logging.info(f"Running {self} in {work_dir} on {len(poses.df.index)} poses.")

        # Look for output-file in pdb-dir. If output is present and correct, skip LigandMPNN.
        scorefile = os.path.join(work_dir, f"ligandmpnn_scores.{poses.storage_format}")
        if (scores := self.check_for_existing_scorefile(scorefile=scorefile, overwrite=overwrite)) is not None:
            logging.info(f"Found existing scorefile at {scorefile}. Returning {len(scores.index)} poses from previous run without running calculations.")
            output = RunnerOutput(poses=poses, results=scores, prefix=prefix, index_layers=self.index_layers)
            return output.return_poses()


        # parse pose_opt_cols into pose_options format.
        pose_opt_cols_options = self.parse_pose_opt_cols(poses=poses, pose_opt_cols=pose_opt_cols, output_dir=work_dir)

        # parse pose_options
        pose_options = self.prep_pose_options(poses=poses, pose_options=pose_options)

        # combine pose_options and pose_opt_cols_options (priority goes to pose_opt_cols_options):
        pose_options = [options_flags_to_string(*parse_generic_options(pose_opt, pose_opt_cols_opt, sep="--"), sep="--") for pose_opt, pose_opt_cols_opt in zip(pose_options, pose_opt_cols_options)]

        # write ligandmpnn cmds:
        cmds = [self.write_cmd(pose, output_dir=work_dir, model=model_type, nseq=nseq, options=options, pose_options=pose_opts) for pose, pose_opts in zip(poses.df['poses'].to_list(), pose_options)]

        # batch_run setup:
        if run_batch:
            cmds = self.setup_batch_run(cmds, num_batches=jobstarter.max_cores, output_dir=work_dir)

        # prepend pre-cmd if defined:
        if self.pre_cmd:
            cmds = prepend_cmd(cmds = cmds, pre_cmd=self.pre_cmd)

        # create output directories, LigandMPNN crashes sometimes when multiple processes create the same directory simultaneously (frozen os error)
        for folder in ["backbones", "input_json_files", "packed", "seqs"]:
            os.makedirs(os.path.join(work_dir, folder), exist_ok=True)

        # run
        jobstarter.start(
            cmds=cmds,
            jobname="ligandmpnn",
            wait=True,
            output_path=f"{work_dir}/"
        )

        # collect scores
        scores = collect_scores(
            work_dir=work_dir,
            return_seq_threaded_pdbs_as_pose=return_seq_threaded_pdbs_as_pose,
            preserve_original_output=preserve_original_output,
            pack_sidechains=pack_sidechains
        )

        if len(scores.index) < len(poses.df.index) * nseq:
            raise RuntimeError("Number of output poses is smaller than number of input poses * nseq. Some runs might have crashed!")

        logging.info(f"Saving scores of {self} at {scorefile}")
        self.save_runner_scorefile(scores=scores, scorefile=scorefile)

        logging.info(f"{self} finished. Returning {len(scores.index)} poses.")
        return RunnerOutput(poses=poses, results=scores, prefix=prefix, index_layers=self.index_layers).return_poses()



[docs]
    def check_for_batch_run(self, pose_options: str, pose_opt_cols):
        """
        Checks if LigandMPNN can be run in batch mode.

        This method determines whether the LigandMPNN process can be executed in batch mode. It does this by checking if pose-specific options are not provided and if only multi-residue columns are specified in the pose options.

        Parameters:
            pose_options (str): Pose-specific options for the LigandMPNN script.
            pose_opt_cols (dict): Dictionary of pose-specific options for the LigandMPNN script.

        Returns:
            bool: True if LigandMPNN can be run in batch mode, False otherwise.

        Examples:
            Here is an example of how to use the `check_for_batch_run` method:

            .. code-block:: python

                # Initialize the LigandMPNN class
                ligandmpnn = LigandMPNN()

                # Check for batch run
                can_batch_run = ligandmpnn.check_for_batch_run(
                    pose_options=None,
                    pose_opt_cols={"fixed_residues": "fixed_res_col"}
                )

                print(can_batch_run)  # Outputs: True or False

        Further Details:
            - **Batch Mode Check:** The method checks if the `pose_options` is None and if the `pose_opt_cols` contains only multi-residue columns, which are necessary for batch processing.
        """
        #no_incompatible_options = pose_options or fixed_res_col or design_res_col # checks if any of those options is set
        return not pose_options and self.multi_cols_only(pose_opt_cols)



[docs]
    def multi_cols_only(self, pose_opt_cols: dict) -> bool:
        '''checks if only multi_res cols are in pose_opt_cols dict. Only _multi arguments can be used for ligandmpnn_batch runs.'''
        multi_cols = ["omit_AA_per_residue", "bias_AA_per_residue", "redesigned_residues", "fixed_residues"]
        return not pose_opt_cols or all((col in multi_cols for col in pose_opt_cols))



[docs]
    def setup_batch_run(self, cmds:list[str], num_batches:int, output_dir:str) -> list[str]:
        """
        Concatenates commands for MPNN into batches so that MPNN does not have to be loaded individually for each PDB file.

        This method prepares the LigandMPNN commands for batch execution. It concatenates the commands into batches to optimize the running process by reducing the overhead of loading the MPNN model multiple times.

        Parameters:
            cmds (list[str]): A list of commands to run LigandMPNN.
            num_batches (int): The number of batches to split the commands into.
            output_dir (str): The directory where the batch input JSON files will be saved.

        Returns:
            list[str]: A list of concatenated batch commands.

        Examples:
            Here is an example of how to use the `setup_batch_run` method:

            .. code-block:: python

                # Initialize the LigandMPNN class
                ligandmpnn = LigandMPNN()

                # Example commands
                cmds = [
                    "/path/to/python /path/to/run.py --option1=value1 --pdb_path=path1.pdb",
                    "/path/to/python /path/to/run.py --option2=value2 --pdb_path=path2.pdb",
                    # More commands...
                ]

                # Setup batch run
                batch_cmds = ligandmpnn.setup_batch_run(
                    cmds=cmds,
                    num_batches=2,
                    output_dir="/path/to/output"
                )

                print(batch_cmds)  # Outputs the batch commands

        Further Details:
            - **Batch Command Setup:** The method splits the provided commands into sublists based on the number of batches. It then processes each sublist to handle multi-residue options and generate corresponding JSON files.
            - **JSON Directory:** The method sets up a directory for storing JSON files that contain mappings for multi-residue options.
            - **Command Concatenation:** Each command sublist is processed to extract and convert multi-residue options into JSON files, which are then referenced in the batch commands.
        """
        def _strip_quotes(in_opt):
            return in_opt.strip("'").strip('"') if isinstance(in_opt, str) else in_opt
        multi_cols = {
            "omit_AA_per_residue": "omit_AA_per_residue_multi",
            "bias_AA_per_residue": "bias_AA_per_residue_multi", 
            "redesigned_residues": "redesigned_residues_multi",
            "fixed_residues": "fixed_residues_multi", 
            "pdb_path": "pdb_path_multi"
        }
        # setup json directory
        json_dir = f"{output_dir}/input_json_files/"
        if not os.path.isdir(json_dir):
            os.makedirs(json_dir, exist_ok=True)

        # split cmds list into n=num_batches sublists
        cmd_sublists = jobstarters.split_list(cmds, n_sublists=num_batches)

        # concatenate cmds: parse _multi arguments into .json files and keep all other arguments in option.
        batch_cmds = []
        for i, cmd_list in enumerate(cmd_sublists, start=1):
            full_cmd_list = [cmd.split(" ", 2) for cmd in cmd_list] # splits off the first two things of the command: [{python} {ligmpnn.py} {rest of command}] and extracts {rest of command}
            opts_flags_list = [regex_expand_options_flags(cmd[-1]) for cmd in full_cmd_list]
            opts_list = [x[0] for x in opts_flags_list] # regex_expand_options_flags() returns (options, flags)

            # take first cmd for general options and flags
            full_opts_flags: tuple[dict, set] = opts_flags_list[0]
            cmd_start = " ".join(full_cmd_list[0][:2]) # keep /path/to/python3 /path/to/run.py

            # extract lists for _multi options
            for col, multi_col in multi_cols.items():
                # if col does not exist in options, skip:
                if col not in opts_list[0]:
                    continue

                # extract pdb-file to argument mapping as dictionary:
                col_dict = {opts["pdb_path"]: _strip_quotes(opts[col]) for opts in opts_list} # remove all quotes from strings for LigandMPNN to read options correctly.

                # write col_dict to json
                col_json_path = f"{json_dir}/{col}_{i}.json"
                with open(col_json_path, 'w', encoding="UTF-8") as f:
                    json.dump(col_dict, f)

                # remove single option from full_opts_flags
                del full_opts_flags[0][col]

                # set cmd_json file as _multi option:
                full_opts_flags[0][multi_col] = col_json_path

            # reassemble command and put into batch_cmds
            batch_cmd = f"{cmd_start} {options_flags_to_string(*full_opts_flags, sep='--', no_quotes=True)}"
            batch_cmds.append(batch_cmd)

        return batch_cmds



[docs]
    def parse_pose_opt_cols(self, poses: Poses, output_dir: str, pose_opt_cols: dict = None) -> list[dict]:
        """
        Parses pose-specific options columns into pose options formatted strings.

        This method processes the `pose_opt_cols` dictionary and converts its contents into a format that can be used as part of the LigandMPNN pose options. It ensures that the options are properly structured and, if necessary, writes specific arguments into JSON files.

        Parameters:
            poses (Poses): The Poses object containing the protein structures.
            output_dir (str): The directory where JSON files for multi-residue options will be saved.
            pose_opt_cols (dict, optional): Dictionary of pose-specific options for the LigandMPNN script. Defaults to None.

        Returns:
            list[dict]: A list of dictionaries containing the parsed pose options formatted as strings.

        Raises:
            ValueError: If both fixed_residues and redesigned_residues are defined in pose_opt_cols, or if specified columns do not exist in poses.df.

        Examples:
            Here is an example of how to use the `parse_pose_opt_cols` method:

            .. code-block:: python

                # Initialize the LigandMPNN class
                ligandmpnn = LigandMPNN()

                # Example Poses object and pose_opt_cols
                poses = Poses()
                pose_opt_cols = {
                    "bias_AA_per_residue": "bias_col",
                    "fixed_residues": "fixed_res_col"
                }

                # Parse pose options
                parsed_opts = ligandmpnn.parse_pose_opt_cols(
                    poses=poses,
                    output_dir="/path/to/output",
                    pose_opt_cols=pose_opt_cols
                )

                print(parsed_opts)  # Outputs the parsed pose options

        Further Details:
            - **Option Parsing:** The method converts the `pose_opt_cols` dictionary into a list of strings formatted as pose options. It handles various types of options, including those that need to be written into JSON files and those that can be parsed directly from residue selections.
            - **JSON Directory Setup:** If necessary, the method sets up a directory for storing JSON files that contain mappings for multi-residue options.
            - **Error Handling:** The method includes checks to ensure that incompatible options are not specified simultaneously and that all specified columns exist in the poses DataFrame.
        """
        # return list of empty strings if pose_opts_col is None.
        if pose_opt_cols is None:
            return ["" for _ in poses]

        # setup output_dir for .json files
        if any([key in ["bias_AA_per_residue", "omit_AA_per_residue"] for key in pose_opt_cols]):
            json_dir = f"{output_dir}/input_json_files/"
            if not os.path.isdir(json_dir):
                os.makedirs(json_dir, exist_ok=True)

        # check if fixed_residues and redesigned_residues were set properly (gets checked in LigandMPNN too, so maybe this is redundant.)
        if "fixed_residues" in pose_opt_cols and "redesigned_residues" in pose_opt_cols:
            raise ValueError("Cannot define both <fixed_res_column> and <design_res_column>!")

        # check if all specified columns exist in poses.df:
        for col in list(pose_opt_cols.values()):
            col_in_df(poses.df, col)

        # parse pose_options
        pose_options = []
        for pose in poses:
            opts = []
            for mpnn_arg, mpnn_arg_col in pose_opt_cols.items():
                # arguments that must be written into .json files:
                if mpnn_arg in ["bias_AA_per_residue", "omit_AA_per_residue"]:
                    output_path = f"{json_dir}/{mpnn_arg}_{pose['poses_description']}.json"
                    opts.append(f"--{mpnn_arg}={write_to_json(pose[mpnn_arg_col], output_path)}")

                # arguments that can be parsed as residues (from ResidueSelection objects):
                elif mpnn_arg in ["redesigned_residues", "fixed_residues", "transmembrane_buried", "transmembrane_interface"]:
                    opts.append(f"--{mpnn_arg}={parse_residues(pose[mpnn_arg_col])}")

                # all other arguments:
                else:
                    opts.append(f"--{mpnn_arg}={pose[mpnn_arg_col]}")
            pose_options.append(" ".join(opts))
        return pose_options



[docs]
    def write_cmd(self, pose_path:str, output_dir:str, model:str, nseq:int, options:str, pose_options:str):
        """
        Writes the command to run ligandmpnn.py.

        This method constructs the command necessary to run the LigandMPNN script, incorporating various options and parameters. It ensures that the command is correctly formatted and includes all required arguments.

        Parameters:
            pose_path (str): The path to the input PDB file for the pose.
            output_dir (str): The directory where the output files will be saved.
            model (str): The type of model to use (e.g., "ligand_mpnn").
            nseq (int): The number of sequences to generate for each input pose. Defaults to 1.
            options (str): Additional options for the LigandMPNN script.
            pose_options (str): Pose-specific options for the LigandMPNN script.

        Returns:
            str: The constructed command string to run LigandMPNN.

        Raises:
            ValueError: If the specified model is not one of the available models.

        Examples:
            Here is an example of how to use the `write_cmd` method:

            .. code-block:: python

                # Initialize the LigandMPNN class
                ligandmpnn = LigandMPNN()

                # Write the command
                cmd = ligandmpnn.write_cmd(
                    pose_path="path/to/input.pdb",
                    output_dir="path/to/output",
                    model="ligand_mpnn",
                    nseq=10,
                    options="some_option=some_value",
                    pose_options="pose_option=pose_value"
                )

                print(cmd)  # Outputs the constructed command string

        Further Details:
            - **Model Validation:** The method checks if the specified model is among the available models and raises an error if it is not.
            - **Option Parsing:** The method parses generic options and pose-specific options, ensuring that necessary safety checks and defaults are applied.
            - **Command Construction:** The method assembles the final command string, including paths, model checkpoints, options, and other necessary parameters.
        """
        # parse ligandmpnn_dir:
        ligandmpnn_dir = os.path.dirname(self.script_path)

        # check if specified model is correct.
        available_models = ["protein_mpnn", "ligand_mpnn", "soluble_mpnn", "global_label_membrane_mpnn", "per_residue_label_membrane_mpnn"]
        if model not in available_models:
            raise ValueError(f"{model} must be one of {available_models}!")

        # parse options
        opts, flags = parse_generic_options(options, pose_options)

        # safetychecks:
        if "model_type" not in opts:
            opts["model_type"] = model or "ligand_mpnn"
        if "number_of_batches" not in opts:
            opts["number_of_batches"] = nseq or "1"
        # define model_checkpoint option:
        if f"checkpoint_{model}" not in opts:
            model_checkpoint_options = f"--checkpoint_{model}={ligandmpnn_dir}/{LIGANDMPNN_CHECKPOINT_DICT[model]}"
        else:
            model_checkpoint_options = opts[f"checkpoint_{model}"]

        # safety
        logging.debug("Setting parse_atoms_with_zero_occupancy to 1 to ensure that the run does not crash.")
        if "parse_atoms_with_zero_occupancy" not in opts:
            opts["parse_atoms_with_zero_occupancy"] = "1"
        elif opts["parse_atoms_with_zero_occupancy"] != "1":
            opts["parse_atoms_with_zero_occupancy"] = "1"

        # convert to string
        options = options_flags_to_string(opts, flags, sep="--")

        # write command and return.
        return f"{self.python_path} {self.script_path} {model_checkpoint_options} --out_folder {output_dir}/ --pdb_path {pose_path} {options}"




[docs]
def collect_scores(work_dir: str, return_seq_threaded_pdbs_as_pose: bool, preserve_original_output: bool = True, pack_sidechains: bool = False) -> pd.DataFrame:
    """
    Collects scores from the LigandMPNN output.

    This method processes the output files generated by LigandMPNN, including multi-sequence FASTA files and PDB files. It reads, renames, and organizes these files into a structured DataFrame.

    Parameters:
        work_dir (str): The directory where LigandMPNN output files are located.
        return_seq_threaded_pdbs_as_pose (bool): If True, replaces FASTA files with sequence-threaded PDB files as poses.
        preserve_original_output (bool, optional): If True, preserves the original output files. Defaults to True.

    Returns:
        pd.DataFrame: A DataFrame containing the collected scores and relevant data from the LigandMPNN output.

    Raises:
        FileNotFoundError: If required output files are not found in the specified directory.

    Examples:
        Here is an example of how to use the `collect_scores` method:

        .. code-block:: python

            # Initialize the LigandMPNN class
            ligandmpnn = LigandMPNN()

            # Collect scores from the output directory
            scores = ligandmpnn.collect_scores(
                work_dir="/path/to/output",
                return_seq_threaded_pdbs_as_pose=True,
                preserve_original_output=False
            )

            print(scores)  # Outputs the collected scores DataFrame

    Further Details:
        - **Output Processing:** The method reads and parses multi-sequence FASTA files, converts sequences into a structured dictionary, and writes new FASTA files if necessary.
        - **File Management:** Original output files are copied to dedicated directories, and new files are generated and organized for easy access. Optionally, original files can be preserved or deleted based on the `preserve_original_output` parameter.
        - **Error Handling:** The method includes checks to ensure that required output files are present, raising errors if files are missing or paths are incorrect.
    """
    def mpnn_fastaparser(fasta_path):
        '''reads in ligandmpnn multi-sequence fasta, renames sequences and returns them'''
        records = list(Bio.SeqIO.parse(fasta_path, "fasta"))
        #maxlength = len(str(len(records)))

        # Set continuous numerating for the names of mpnn output sequences:
        name = records[0].name.replace(",", "")
        records[0].name = name
        for i, x in enumerate(records[1:]):
            setattr(x, "name", f"{name}_{str(i+1).zfill(4)}")

        return records

    def convert_ligandmpnn_seqs_to_dict(seqs):
        '''
        Takes already parsed list of fastas as input <seqs>. Fastas can be parsed with the function mpnn_fastaparser(file).
        Should be put into list.
        Converts mpnn fastas into a dictionary:
        {
            "col_1": [vals]
                ...
            "col_n": [vals]
        }
        '''
        # Define cols and initiate them as empty lists:
        seqs_dict = {}
        cols = ["mpnn_origin", "seed", "description", "sequence", "T", "id", "seq_rec", "overall_confidence", "ligand_confidence"]
        for col in cols:
            seqs_dict[col] = []

        # Read scores of each sequence in each file and append them to the corresponding columns:
        for seq in seqs:
            for f in seq[1:]:
                seqs_dict["mpnn_origin"].append(seq[0].name)
                seqs_dict["sequence"].append(str(f.seq))
                seqs_dict["description"].append(f.name)
                d = {k: float(v) for k, v in [x.split("=") for x in f.description.split(", ")[1:]]}
                for k, v in d.items():
                    seqs_dict[k].append(v)
        return seqs_dict

    def write_mpnn_fastas(seqs_dict: dict) -> pd.DataFrame:
        seqs_dict["location"] = list()
        for d, s in zip(seqs_dict["description"], seqs_dict["sequence"]):
            seqs_dict["location"].append((fa_file := f"{seq_dir}/{d}.fa"))
            with open(fa_file, 'w', encoding="UTF-8") as f:
                f.write(f">{d}\n{s}")
        return pd.DataFrame(seqs_dict)

    def rename_mpnn_pdb(pdb: str) -> None:
        '''changes single digit file extension to 4 digit file extension'''
        filename, extension = os.path.splitext(pdb)[0].rsplit('_', 1)
        filename = f"{filename}_{extension.zfill(4)}.pdb"
        shutil.move(pdb, filename)

    def rename_packed_pdb(pdb_path: str) -> None:
        '''changes single digit file extension to 4 digit file extension.'''
        filename = os.path.splitext(pdb_path)[0]
        name_split = filename.split("_")
        name_split[-1] = name_split[-1].zfill(4)
        name_split[-2] = name_split[-2].zfill(4)
        name_split.remove("packed")
        filename = f"{'_'.join(name_split)}.pdb"
        shutil.move(pdb_path, filename)
        if filename.endswith("_0001.pdb"):
            shutil.copy(filename, filename.rsplit("_", 1)[0] + ".pdb")

    # read .pdb files
    seq_dir = os.path.join(work_dir, 'seqs')
    pdb_dir = os.path.join(work_dir, 'backbones')
    fl = glob(f"{seq_dir}/*.fa")
    pl = glob(f"{pdb_dir}/*.pdb")
    if not fl:
        raise FileNotFoundError(f"No .fa files were found in the output directory of LigandMPNN {seq_dir}. LigandMPNN might have crashed (check output log), or path might be wrong!")
    if not pl:
        raise FileNotFoundError(f"No .pdb files were found in the output directory of LigandMPNN {pdb_dir}. LigandMPNN might have crashed (check output log), or path might be wrong!")

    seqs = [mpnn_fastaparser(fasta) for fasta in fl]
    seqs_dict = convert_ligandmpnn_seqs_to_dict(seqs)

    original_seqs_dir = os.path.join(seq_dir, 'original_seqs')
    logging.info(f"Copying original .fa files into directory {original_seqs_dir}")
    os.makedirs(original_seqs_dir, exist_ok=True)
    _ = [shutil.move(fasta, os.path.join(original_seqs_dir, os.path.basename(fasta))) for fasta in fl]

    original_pdbs_dir = os.path.join(pdb_dir, 'original_backbones')
    logging.info(f"Copying original .pdb files into directory {original_pdbs_dir}")
    os.makedirs(original_pdbs_dir, exist_ok=True)
    _ = [shutil.copy(pdb, os.path.join(original_pdbs_dir, os.path.basename(pdb))) for pdb in pl]
    _ = [rename_mpnn_pdb(pdb) for pdb in pl]

    # Write new .fa files by iterating through "description" and "sequence" keys of the seqs_dict
    logging.info(f"Writing new fastafiles at original location {seq_dir}.")
    scores = write_mpnn_fastas(seqs_dict)

    if return_seq_threaded_pdbs_as_pose:
        #replace .fa with sequence threaded pdb files as poses
        scores['location'] = [os.path.join(pdb_dir, f"{os.path.splitext(os.path.basename(series['location']))[0]}.pdb") for _, series in scores.iterrows()]

    if pack_sidechains:
        pack_dir = os.path.join(work_dir, 'packed')
        pack_fl = glob(f"{pack_dir}/*_1.pdb")
        if not pack_fl:
            raise FileNotFoundError(f"No .pdb files were found in the output directory of LigandMPNN {pack_dir}. LigandMPNN might have crashed (check output log), or path might be wrong!")
        for pdb in pack_fl:
            rename_packed_pdb(pdb)

        # extract only first replicate
        scores['location'] = [os.path.join(pack_dir, f"{os.path.splitext(os.path.basename(series['location']))[0]}.pdb") for _, series in scores.iterrows()]

    if not preserve_original_output:
        if os.path.isdir(original_seqs_dir):
            logging.info(f"Deleting original .fa files at {original_seqs_dir}!")
            shutil.rmtree(original_seqs_dir)
        if os.path.isdir(original_pdbs_dir):
            logging.info(f"Deleting original .pdb files at {original_pdbs_dir}!")
            shutil.rmtree(original_pdbs_dir)

    return scores



[docs]
def parse_residues(residues:object) -> str:
    """
    Parses residues from either ResidueSelection object, list, or MPNN-formatted string into MPNN-formatted string.

    This function converts the input residues into a format compatible with MPNN. It supports conversion from ResidueSelection objects, comma-separated strings, and lists of residues.

    Parameters:
        residues (object): The input residues to be parsed. This can be a ResidueSelection object, a comma-separated string, or a list of residues.

    Returns:
        str: The residues formatted as a string compatible with MPNN.

    Raises:
        ValueError: If the input type is not supported (i.e., not a str or ResidueSelection).

    Examples:
        Here is an example of how to use the `parse_residues` function:

        .. code-block:: python

            from protflow.residues import ResidueSelection

            # Example ResidueSelection object
            residues = ResidueSelection(["A:10", "A:20"])

            # Parse residues
            parsed_residues = parse_residues(residues)
            print(parsed_residues)  # Outputs: "A:10 A:20"

            # Example string input
            residues_str = "A:10,A:20"
            parsed_residues = parse_residues(residues_str)
            print(parsed_residues)  # Outputs: "A:10 A:20"

    Further Details:
        - **ResidueSelection Object:** The function calls the `to_string` method of the ResidueSelection object to get the MPNN-formatted string.
        - **String Input:** For comma-separated string inputs, the function splits the string by commas and joins the parts with spaces.
    """
    # ResidueSelection should have to_mpnn function.
    if isinstance(residues, ResidueSelection):
        return residues.to_string(delim=" ")

    # strings:
    if isinstance(residues, str):
        if len(residues.split(",")) > 1:
            return " ".join(residues.split(","))
        return residues
    raise ValueError(f"Residues must be of type str or ResidueSelection. Type: {type(residues)}")



[docs]
def write_to_json(input_dict: dict, output_path:str) -> str:
    '''Writes json serializable :input_dict: into file and returns path to file. Returns path to json file :output_path:'''
    with open(output_path, 'w', encoding="UTF-8") as f:
        json.dump(input_dict, f)
    return output_path



[docs]
def create_distance_conservation_bias_cmds(poses: Poses, prefix: str, center: Union[str,ResidueSelection], shell_distances: list = [10, 15, 20, 1000], shell_biases: list = [0, 0.25, 0.5, 1], center_atoms: list[str] = None, noncenter_atoms: list[str] = ["CA"], jobstarter: JobStarter = None, overwrite: bool = False) -> Poses:
    """
    Creates distance-based conservation bias commands for LigandMPNN runs and saves them in a poses DataFrame column.

    This function creates commands for conservation bias based on shells with a distance from a given ResidueSelection.

    Parameters:
        poses (Poses): The Poses object containing the protein structures.
        prefix (str): A prefix used as output folder and column name in the poses DataFrame to save the commands.
        center (str or ResidueSelection): The center of the shells. Can be either a single ResidueSelection or a poses DataFrame column containing ResidueSelections.
        shell_distances (list, optional): The shells for creating conservation bias. The numbers represent the distance from the center. Defaults to [10, 15, 20, 100].
        shell_biases (list, optional): The strength of the bias for each shell. Defaults to [0, 0.25, 0.5, 1].
        center_atoms (list, optional): The atom names of the center ResidueSelection which should be used for shell distance calculations. None means all atoms are selected. Defaults to None.
        noncenter_atoms (list, optional): The atom names of noncenter residues which should be used for shell distance calculations. None means all atoms are selected. Defaults to ["CA"].
        jobstarter (JobStarter, optional): An instance of the JobStarter class, which manages job execution. Defaults to None.
        overwrite (bool, optional): If True, overwrite existing output files. Defaults to False.

    Returns:
        Poses: The updated Poses object containing the commands for conservation bias in a poses DataFrame column.

    Raises:
        KeyError: If shell_distances are not sorted in ascending order.

    Examples:
        Here is an example of how to use the `create_distance_conservation_bias_cmds` method:

        .. code-block:: python

            from protflow.poses import Poses
            from protflow.jobstarters import LocalJobStarter
            from protflow.residue_selectors import ResidueSelection
            from ligandmpnn import create_distance_conservation_bias_cmds

            # Create instances of necessary classes
            poses = Poses(poses=".", glob_suffix="*.pdb)
            jobstarter = LocalJobStarter()
            central_selection = ResidueSelection("A23")

            # Run the diffusion process
            poses = create_distance_conservation_bias_cmds(
                poses=poses,
                prefix="prefix",
                prefix="conservation_bias_cmd",
                jobstarter=jobstarter,
                center=central_selection,
            )

            # Access and process the results
            print(poses.df["prefix"])

    Further Details:
        - **Setup and Execution:** The method ensures that the environment is correctly set up, directories are prepared, and necessary commands are constructed and executed.
        - **Output Management:** The method handles the collection and processing of output data, ensuring that results are organized and accessible for further analysis.
        - **Customization:** Extensive customization options are provided through parameters, allowing users to tailor the process to their specific needs.

    This method is designed to streamline the creation of distance-based conservation bias commands  for LigandMPNN within the ProtFlow framework, making it easier for researchers and developers to perform and analyze protein design simulations.
    """
    from protflow.tools.residue_selectors import DistanceSelector
    from protflow.metrics.selection_identity import SelectionIdentity

    def create_bias_dict(resdict: dict, bias: float):
        bias_dict = {}
        for res, id_ in resdict.items():
            bias_dict[res] = {id_: bias}
        return bias_dict

    def combine_dicts(dict_list: list[dict]):
        out_dict = {}
        for in_dict in dict_list:
            out_dict.update(in_dict)
        return out_dict

    # check input
    if not shell_distances == sorted(shell_distances):
        raise KeyError(f"shell_distances must be in ascending order like {sorted(shell_distances)}, not {shell_distances}!")

    # set python path
    python_path = os.path.join(load_config_path(require_config(), "PROTFLOW_ENV"), "python")

    # create output directory
    os.makedirs(working_dir := os.path.abspath(os.path.join(poses.work_dir, prefix)), exist_ok=True)
    original_work_dir = poses.work_dir
    poses.set_work_dir(working_dir)

    # initialize residue selector and id metric
    selector = DistanceSelector(center=center)
    selid = SelectionIdentity(python_path=python_path, jobstarter=jobstarter, overwrite=overwrite)

    # iterate over all shell distances
    for index, (dist, bias) in enumerate(zip(shell_distances, shell_biases)):
        # select residues in shell
        selector.select(prefix=f"{prefix}_selection_{dist}", poses=poses, distance=dist, operator="<=", center_atoms=center_atoms, noncenter_atoms=noncenter_atoms, include_center=False)
        if index == 0:
            poses.df[f"{prefix}_selected_residues"] = poses.df[f"{prefix}_selection_{dist}"]
        else:
            # subtract previous selections in shells that are not the innermost shell
            poses.df[f"{prefix}_selection_{dist}"] = poses.df[f"{prefix}_selection_{dist}"] - poses.df[f"{prefix}_selected_residues"]
            # add current selection to overall selection
            poses.df[f"{prefix}_selected_residues"] = poses.df[f"{prefix}_selected_residues"] + poses.df[f"{prefix}_selection_{dist}"]

        # determine residue ids
        selid.run(poses=poses, prefix=f"{prefix}_selection_{dist}_ids", residue_selection=f"{prefix}_selection_{dist}", onelettercode=True)

        # create bias dictionary
        poses.df[f"{prefix}_{dist}_bias_dicts"] = poses.df.apply(
            lambda row: create_bias_dict(row[f"{prefix}_selection_{dist}_ids_selection_identities"], bias), axis=1
        )

    # write bias dict for all shells
    poses.df[f"{prefix}_overall_bias_dict"] = poses.df.apply(lambda row: combine_dicts([row[f"{prefix}_{dist}_bias_dicts"] for dist in shell_distances]), axis=1)

    # write json files for each dict
    os.makedirs(dict_dir := os.path.join(working_dir, "bias_dicts"), exist_ok=True)
    dict_paths = []
    for _, row in poses.df.iterrows():
        with open(dict_path := os.path.join(dict_dir, f"{row['poses_description']}_bias_dict.json"), 'w', encoding="UTF-8") as f:
            json.dump(row[f"{prefix}_overall_bias_dict"], f, indent=4)
        dict_paths.append(dict_path)

    # save paths to json files in poses dataframe
    poses.df[f"{prefix}_overall_bias_json"] = dict_paths

    # save cmds for LigandMPNN in poses dataframe
    poses.df[f"{prefix}"] = [f"--bias_AA_per_residue {dict_path}" for dict_path in dict_paths]

    # clean dataframe
    cols_to_drop = [
        f"{prefix}_selection_{dist}" for dist in shell_distances
    ] + [
        f"{prefix}_selection_{dist}_ids_description" for dist in shell_distances
    ] + [
        f"{prefix}_selection_{dist}_ids_selection_identities" for dist in shell_distances
    ] + [
        f"{prefix}_selection_{dist}_ids_location" for dist in shell_distances
    ] + [
        f"{prefix}_{dist}_bias_dicts" for dist in shell_distances] + [f"{prefix}_selected_residues"]
    poses.df.drop(cols_to_drop, axis=1, inplace=True)

    # revert to original work dir
    poses.set_work_dir(original_work_dir)

    return poses