Write a Generic Metric

Use protflow.metrics.generic_metric_runner.GenericMetric when you need one function call per pose and the output can be represented as a single JSON-serializable value per pose.

Typical use-cases:

quick structural sanity checks that are not worth a dedicated runner
custom metrics used only in one project
wrapping an existing helper function so it can run through a ProtFlow JobStarter

Do not use GenericMetric when the metric:

needs multiple poses at once
produces non-JSON output
should return several columns instead of one <prefix>_data column
needs different options for different poses in one run

What the function must look like

The worker process imports the target with importlib.import_module(module) and then calls function(pose_path, **options) for each pose. That means:

the first argument must be the pose path as a string
additional parameters must be regular keyword arguments
the return value must be JSON-serializable
the function must live in an importable Python module

Notebook-only functions, lambdas, and functions defined only in an interactive session will not work, because worker jobs import the function again in a new Python process.

Make the module importable

The module=... argument is a Python module path, not a shell PATH lookup.

Your metric function must be importable by the Python interpreter used by GenericMetric:

the interpreter given through python_path, or
the default ProtFlow environment interpreter from PROTFLOW_ENV

Valid ways to make the module importable:

put the function into protflow itself
put it into another package that is installed in the same environment
install your own package into that environment
export PYTHONPATH before launching ProtFlow so the module directory is on Python’s import path

If you launch jobs through SLURM, the compute nodes also need that same Python environment or PYTHONPATH setup.

Example file layout

If you store your function in:

my_project_metrics/
|-- __init__.py
`-- custom_metrics.py

then the corresponding module string is:

module="my_project_metrics.custom_metrics"

Example metric with options

The example below counts C-alpha atoms on a selected chain whose B-factor is at least min_bfactor. For AlphaFold-style structures, this can be used as a quick proxy for “confident residues on chain A” if confidence values are stored in the B-factor column.

# file: my_project_metrics/custom_metrics.py
from Bio.PDB import PDBParser


def count_confident_ca_atoms(pose: str, chain: str, min_bfactor: float = 0.0) -> int:
    """Return the number of CA atoms on one chain above a B-factor threshold."""
    structure = PDBParser(QUIET=True).get_structure("pose", pose)
    model = next(structure.get_models())

    if chain not in model:
        raise KeyError(f"Chain {chain!r} not found in {pose}")

    return sum(
        1
        for atom in model[chain].get_atoms()
        if atom.id == "CA" and atom.bfactor >= min_bfactor
    )

This function satisfies the GenericMetric contract:

pose is the first argument
chain and min_bfactor are regular keyword arguments
the return value is an integer, which is JSON-serializable

Run the metric in ProtFlow

from protflow.poses import Poses
from protflow.jobstarters import LocalJobStarter
from protflow.metrics.generic_metric_runner import GenericMetric

jobstarter = LocalJobStarter(max_cores=8)

poses = Poses(
    poses="/data/input_pdbs",
    glob_suffix="*.pdb",
    work_dir="generic_metric_example",
    jobstarter=jobstarter,
)

confident_len = GenericMetric(
    module="my_project_metrics.custom_metrics",
    function="count_confident_ca_atoms",
    options={"chain": "A", "min_bfactor": 70.0},
    jobstarter=jobstarter,
)

poses = confident_len.run(poses=poses, prefix="chain_a_confident_len")
print(poses.df[["poses_description", "chain_a_confident_len_data"]])

What happens during run():

ProtFlow creates <work_dir>/<prefix>.
It splits the pose list into chunks.
It starts one worker command per chunk through the selected JobStarter.
Each worker imports my_project_metrics.custom_metrics and calls count_confident_ca_atoms(pose, chain="A", min_bfactor=70.0) for every pose in its chunk.
The results are merged back into poses.df.

Results and files

After the run:

the metric value is in chain_a_confident_len_data
ProtFlow also stores chain_a_confident_len_description and chain_a_confident_len_location
the cached runner scorefile is written inside the run directory

If you run the same prefix again and do not set overwrite=True, ProtFlow reuses the cached scorefile instead of recomputing the metric.

Important limits

GenericMetric applies one shared options dictionary to all poses in a single run. It does not support pose-specific options columns.

If you need per-pose options, use one of these approaches:

split the poses into subsets and run GenericMetric multiple times
write a dedicated runner for that metric

Returning a scalar is usually the cleanest option. Returning a list or dict is allowed as long as it is JSON-serializable, but it will still be stored inside a single <prefix>_data column.