Calculate BioPython Geometry Metrics

Use protflow.metrics.biopython_metrics.BiopythonMetricRunner when you want to calculate simple atom-level geometry metrics from BioPython structures. The runner can calculate several metrics for every pose in one run() call, including distances, angles, dihedrals, and plane angles.

This is useful when you want to track geometry such as:

the distance between two atoms
the distance from one atom to a backbone axis
the distance from one atom to a plane
a bond angle or vector angle
a dihedral angle
the angle between two atom-defined planes

The basic idea

Each metric object describes one score column. For example, Distance(name="n_ca_distance", atoms=...) will create a score named <prefix>_n_ca_distance after the runner merges results back into poses.df.

Atom selections are ordered. This is important because metrics interpret atoms by position. A compact atom ID has the format:

("A", 10, "CA")

This means chain A, residue 10, atom CA. You can also use full BioPython-style atom IDs when you need model IDs, hetero residue IDs, or altloc selection.

Setup

from protflow.poses import Poses
from protflow.jobstarters import LocalJobStarter
from protflow.residues import AtomSelection
from protflow.metrics.biopython_metrics import (
    Angle,
    BiopythonMetricRunner,
    Dihedral,
    Distance,
    PlaneAngle,
)

jobstarter = LocalJobStarter(max_cores=4)

poses = Poses(
    poses="data/input_pdbs",
    glob_suffix="*.pdb",
    work_dir="biopython_metrics_example",
    jobstarter=jobstarter,
)

The runner uses the Python interpreter from PROTFLOW_ENV and the auxiliary script directory from AUXILIARY_RUNNER_SCRIPTS_DIR in the ProtFlow config.

Store atom selections in `poses.df`

You can pass a fixed atom selection directly to a metric, or you can store a different atom selection for each pose in poses.df and pass the column name to the metric.

The example below creates several atom-selection columns. The selections are the same for every pose here, but in real workflows these lists can be different for every row.

# Two atoms: point-to-point distance.
poses.df["n_ca_atoms"] = [
    AtomSelection((("A", 1, "N"), ("A", 1, "CA")))
    for _ in poses
]

# Three atoms: distance from the first atom to the line through atoms 2 and 3.
poses.df["point_to_axis_atoms"] = [
    AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C")))
    for _ in poses
]

# Four atoms: distance from atom 1 to the plane through atoms 2, 3, and 4.
poses.df["point_to_plane_atoms"] = [
    AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"), ("A", 1, "O")))
    for _ in poses
]

# Four atoms: angle or dihedral example.
poses.df["n_ca_c_o_atoms"] = [
    AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"), ("A", 1, "O")))
    for _ in poses
]

# Six atoms: two planes, each defined by three atoms.
poses.df["plane_angle_atoms"] = [
    AtomSelection((
        ("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"),
        ("A", 2, "N"), ("A", 2, "CA"), ("A", 2, "C"),
    ))
    for _ in poses
]

When a metric receives atoms="n_ca_atoms", the runner looks up the n_ca_atoms column for each pose row and sends that row-specific selection to the worker script.

Calculate several metrics in one run

The next block calculates multiple scores in a single run() call. It includes three different distance metrics, plus angle, dihedral, and plane-angle metrics.

metrics = [
    Distance(
        name="n_ca_distance",
        atoms="n_ca_atoms",
    ),
    Distance(
        name="n_to_ca_c_axis_distance",
        atoms="point_to_axis_atoms",
    ),
    Distance(
        name="n_to_ca_c_o_plane_distance",
        atoms="point_to_plane_atoms",
        distance_type="point_plane",
    ),
    Angle(
        name="n_ca_c_angle",
        atoms=(("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C")),
    ),
    Dihedral(
        name="n_ca_c_o_dihedral",
        atoms="n_ca_c_o_atoms",
    ),
    PlaneAngle(
        name="residue_1_2_plane_angle",
        atoms="plane_angle_atoms",
    ),
]

biopython_metrics = BiopythonMetricRunner()

poses = biopython_metrics.run(
    poses=poses,
    prefix="bio_geom",
    jobstarter=jobstarter,
    metrics=metrics,
    overwrite=True,
)

print(
    poses.df[
        [
            "poses_description",
            "bio_geom_n_ca_distance",
            "bio_geom_n_to_ca_c_axis_distance",
            "bio_geom_n_to_ca_c_o_plane_distance",
            "bio_geom_n_ca_c_angle",
            "bio_geom_n_ca_c_o_dihedral",
            "bio_geom_residue_1_2_plane_angle",
        ]
    ]
)

What the metrics mean

Distance supports several atom counts:

2 atoms: point-to-point distance
3 atoms: distance from atom 1 to the line through atoms 2 and 3
4 atoms with distance_type="vector_vector": distance between two lines
4 atoms with distance_type="point_plane": distance from atom 1 to a plane

Four-atom distances are ambiguous, so specify distance_type explicitly.

Angle supports:

3 atoms: angle formed by atoms 1-2-3
4 atoms: angle between vectors atom 1 -> atom 2 and atom 3 -> atom 4

Dihedral expects 4 atoms and returns a signed dihedral angle.

PlaneAngle expects 6 atoms. Atoms 1-3 define the first plane, and atoms 4-6 define the second plane. By default, the metric returns the acute angle between the two planes.

Results and caching

The runner writes a scorefile into <poses.work_dir>/<prefix> and merges the results into poses.df. Every score column is prefixed with the run prefix.

For the example above, the output columns include:

bio_geom_n_ca_distance
bio_geom_n_to_ca_c_axis_distance
bio_geom_n_to_ca_c_o_plane_distance
bio_geom_n_ca_c_angle
bio_geom_n_ca_c_o_dihedral
bio_geom_residue_1_2_plane_angle

If you run the same prefix again with overwrite=False, ProtFlow reuses the cached scorefile instead of recalculating the BioPython metrics.