.. _biopython_metrics_tutorial:

Calculate BioPython Geometry Metrics
====================================

Use :class:`protflow.metrics.biopython_metrics.BiopythonMetricRunner` when you
want to calculate simple atom-level geometry metrics from BioPython structures.
The runner can calculate several metrics for every pose in one ``run()`` call,
including distances, angles, dihedrals, plane angles, and relative contact order.

This is useful when you want to track geometry such as:

- the distance between two atoms
- the distance from one atom to a backbone axis
- the distance from one atom to a plane
- a bond angle or vector angle
- a dihedral angle
- the angle between two atom-defined planes
- relative contact order from residue-level contacts

The basic idea
--------------

Each metric object describes one score column. For example,
``Distance(name="n_ca_distance", atoms=...)`` will create a score named
``<prefix>_n_ca_distance`` after the runner merges results back into
``poses.df``.

Atom selections are ordered. This is important because metrics interpret atoms
by position. A compact atom ID has the format:

.. code-block:: python

   ("A", 10, "CA")

This means chain ``A``, residue ``10``, atom ``CA``. You can also use full
BioPython-style atom IDs when you need model IDs, hetero residue IDs, or altloc
selection.

Setup
-----

.. code-block:: python

   from protflow.poses import Poses
   from protflow.jobstarters import LocalJobStarter
   from protflow.residues import AtomSelection
   from protflow.metrics.biopython_metrics import (
       Angle,
       BiopythonMetricRunner,
       ContactOrder,
       Dihedral,
       Distance,
       PlaneAngle,
   )

   jobstarter = LocalJobStarter(max_cores=4)

   poses = Poses(
       poses="data/input_pdbs",
       glob_suffix="*.pdb",
       work_dir="biopython_metrics_example",
       jobstarter=jobstarter,
   )

The runner uses the Python interpreter from ``PROTFLOW_ENV`` and the auxiliary
script directory from ``AUXILIARY_RUNNER_SCRIPTS_DIR`` in the ProtFlow config.

Store atom selections in ``poses.df``
-------------------------------------

You can pass a fixed atom selection directly to a metric, or you can store a
different atom selection for each pose in ``poses.df`` and pass the column name
to the metric.

The example below creates several atom-selection columns. The selections are
the same for every pose here, but in real workflows these lists can be
different for every row.

.. code-block:: python

   # Two atoms: point-to-point distance.
   poses.df["n_ca_atoms"] = [
       AtomSelection((("A", 1, "N"), ("A", 1, "CA")))
       for _ in poses
   ]

   # Three atoms: distance from the first atom to the line through atoms 2 and 3.
   poses.df["point_to_axis_atoms"] = [
       AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C")))
       for _ in poses
   ]

   # Four atoms: distance from atom 1 to the plane through atoms 2, 3, and 4.
   poses.df["point_to_plane_atoms"] = [
       AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"), ("A", 1, "O")))
       for _ in poses
   ]

   # Four atoms: angle or dihedral example.
   poses.df["n_ca_c_o_atoms"] = [
       AtomSelection((("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"), ("A", 1, "O")))
       for _ in poses
   ]

   # Six atoms: two planes, each defined by three atoms.
   poses.df["plane_angle_atoms"] = [
       AtomSelection((
           ("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C"),
           ("A", 2, "N"), ("A", 2, "CA"), ("A", 2, "C"),
       ))
       for _ in poses
   ]

When a metric receives ``atoms="n_ca_atoms"``, the runner looks up the
``n_ca_atoms`` column for each pose row and sends that row-specific selection
to the worker script.

Calculate several metrics in one run
------------------------------------

The next block calculates multiple scores in a single ``run()`` call. It
includes three different distance metrics, plus angle, dihedral, and plane-angle
metrics.

.. code-block:: python

   metrics = [
       Distance(
           name="n_ca_distance",
           atoms="n_ca_atoms",
       ),
       Distance(
           name="n_to_ca_c_axis_distance",
           atoms="point_to_axis_atoms",
       ),
       Distance(
           name="n_to_ca_c_o_plane_distance",
           atoms="point_to_plane_atoms",
           distance_type="point_plane",
       ),
       Angle(
           name="n_ca_c_angle",
           atoms=(("A", 1, "N"), ("A", 1, "CA"), ("A", 1, "C")),
       ),
       Dihedral(
           name="n_ca_c_o_dihedral",
           atoms="n_ca_c_o_atoms",
       ),
       PlaneAngle(
           name="residue_1_2_plane_angle",
           atoms="plane_angle_atoms",
       ),
       ContactOrder(
           name="relative_contact_order",
           contact_distance=8.0,
           contact_atom="CA",
           chains="A",
       ),
   ]

   biopython_metrics = BiopythonMetricRunner()

   poses = biopython_metrics.run(
       poses=poses,
       prefix="bio_geom",
       jobstarter=jobstarter,
       metrics=metrics,
       overwrite=True,
   )

   print(
       poses.df[
           [
               "poses_description",
               "bio_geom_n_ca_distance",
               "bio_geom_n_to_ca_c_axis_distance",
               "bio_geom_n_to_ca_c_o_plane_distance",
               "bio_geom_n_ca_c_angle",
               "bio_geom_n_ca_c_o_dihedral",
               "bio_geom_residue_1_2_plane_angle",
               "bio_geom_relative_contact_order",
           ]
       ]
   )

What the metrics mean
---------------------

``Distance`` supports several atom counts:

- 2 atoms: point-to-point distance
- 3 atoms: distance from atom 1 to the line through atoms 2 and 3
- 4 atoms with ``distance_type="vector_vector"``: distance between two lines
- 4 atoms with ``distance_type="point_plane"``: distance from atom 1 to a plane

Four-atom distances are ambiguous, so specify ``distance_type`` explicitly.

``Angle`` supports:

- 3 atoms: angle formed by atoms 1-2-3
- 4 atoms: angle between vectors atom 1 -> atom 2 and atom 3 -> atom 4

``Dihedral`` expects 4 atoms and returns a signed dihedral angle.

``PlaneAngle`` expects 6 atoms. Atoms 1-3 define the first plane, and atoms 4-6
define the second plane. By default, the metric returns the acute angle between
the two planes.

``ContactOrder`` calculates relative contact order as the sum of intrachain
sequence separations over contacting residue pairs divided by protein length and
contact count. By default it treats two residues as contacting when their ``CA``
atoms are within 8 A, includes residue pairs separated by at least one sequence
position, and uses all chains. Pass ``chains="A"`` for monomeric-chain
calculations or adjust ``contact_distance``, ``contact_atom``, and
``min_sequence_separation`` for a different contact definition.

Results and caching
-------------------

The runner writes a scorefile into ``<poses.work_dir>/<prefix>`` and merges the
results into ``poses.df``. Every score column is prefixed with the run prefix.

For the example above, the output columns include:

.. code-block:: text

   bio_geom_n_ca_distance
   bio_geom_n_to_ca_c_axis_distance
   bio_geom_n_to_ca_c_o_plane_distance
   bio_geom_n_ca_c_angle
   bio_geom_n_ca_c_o_dihedral
   bio_geom_residue_1_2_plane_angle
   bio_geom_relative_contact_order

If you run the same prefix again with ``overwrite=False``, ProtFlow reuses the
cached scorefile instead of recalculating the BioPython metrics.