agglovar.pairwise.weights

Weights for pairwise joins.

Provides a mechanism for weighing and scoring pairwise joins. A weight assigned to each pair of matched variants can help to prioritize the matches. For example, if variant with ID “A1” matches “B1” with a weight of 0.5 and “B2” with a weight of 0.75, then the “B2” match might be chosen over the “B1” match for variant “A1”.

Terminology:

  • Weight column: One column used to compute weights, each with their own weight.

  • Weight element: Computes a weight for each row using by multiplying each column by its

    assigned weight and summing.

  • Weight strategy: One or more weight elements, each computing their own weight.

Each weight element extracts a set of columns, multiplies each by a specified weight, and sums the results producing one weight per row of the join table.

For example, one weight strategy might have these column weights:

Column

Weight

ro

0.5

match

0.5

In this case, the weight for each row is computed as (0.5 * ro) + (0.5 * match). Both “ro” and “match” have a maximum value of 1, so the total weighted sum is between 0.0 and 1.0.

The columns attributes are:

  • col: Column name.

  • weight: Column weight.

  • scale: Column scale.

  • missing: Missing value.

The column scale is used to rescale the column to a range between 0.0 and 1.0. It can take several forms:

  • float: Maximum value. Minimun value is 0.0.

  • (float, float): Minimum and maximum values.

  • (float, float, bool): Minimum and maximum values, and a flag for inverting the range

    after scaling.

Values are first clipped to within the [min, max] range, then scaled (i.e. (value - min) / (max - min)). If the range is inverted, then the scaled value is inverted (i.e. 1 - scaled_value). Inverting values is only support if both min and max are defined. Finally, the scaled value is multiplied by the column weight. For columns that are not scaled, no assumptions are made about their range, they are left as-is and multiplied by the weight.

No assumptions are made about weights, they could be negative or greater than 1. A good design should typically keep weights within [0.0, 1.0], although Agglovar allows designers to make choices about weights the developers might not have anticipated. Use this power with care.

Agglovar allows flexible weight strategies where multiple weight elements are used. For example, it might make sense to compute another set of weights on “size_ro” and “offset_prop” for smaller variants.

Any row with a null value in a weight column causes the weighted sum to become null. By default, this causes that weight element to be skipped. For example, assume that “match” might not have been computed, so the column is null for all values. A weight strategy can use multiple elements, one with a “match” weight and one without. In this case, the weight element computed on “match” will be ignored and the weight will automatically be computed on other weight elements.

When multiple match elements are used in a strategy, the maximum computed non-null weight is used. If all weights are null for a row, then it gets 0.0. This strategy can be changed to accept the first non-null weight instead of the maximum. Null behavior can also be set per column and per weight element by specifying a value for “missing”, to fill in missing values. For example, “missing=0.0” if null values in a column should not cause the whole weight element to be discarded. The weight strategy also has a missing value set to 0.0 by default, so if all weights compute null, the result is 0.0. Set missing=None for the strategy to produce nulls for rows where there were no computable weights.

A weight strategy can be specified in a number of ways. Each level of the strategy has an object, WeightColumn, WeightElement, and WeightStrategy. Each one can take an iterator (usually a tuple or a list) or a dictionary. Each of these is fed to the clas constructor.

For example, a weight column could be specified as “(‘ro’, 0.5)” or “{‘col’=’ro’: ‘weight’=0.5}”.

A weight element could be specified as a list of columns, such as “((‘ro’, 0.2), (‘match’, 0.5))” or as a dictionary where with constructor arguments, which would be required to set “missing”, such as “{‘columns’: ((‘ro’, 0.2), (‘match’: 0.5)), ‘missing’: 0.0}”

An example for constructing using tuples:

WeightStrategy(
    *(
        (
            ('ro', 0.2),
            ('size_ro', 0.2),
            ('offset_prop', 0.1, (0.0, 2.0, True)),
            ('match_prop', 0.5),
        ), (
            ('ro', 0.4, None),
            ('size_ro', 0.4, None),
            ('offset_prop', 0.2, (0.0, 2.0, True)),
        ),
    )
)

And an equivalent form using a mix of dicts and tuples, but with some “missing” values set for illustration purposes:

WeightStrategy(
    *(
        {
            'columns': (
                ('ro', 0.2),
                ('size_ro', 0.2),
                ('offset_prop', 0.1, (0.0, 2.0, True)),
                ('match_prop', 0.5),
            ),
            'missing': 0.0,
        }, (
            {'col': 'ro', 'weight': 0.4},
            ('size_ro', 0.4, None),
            ('offset_prop', 0.2, (0.0, 2.0, True)),
        ),
    ),
    priority='MAX'
)

Attributes

DEFAULT_WEIGHT_STRATEGY

Default strategy for computing weights.

Classes

ElementPriority

Prioritize multiple matches.

WeightColumn

An element representing the weight for one column.

WeightElement

Represents one strategy for computing weights.

WeightStrategy

Represents one strategy for computing weights.

Module Contents

class agglovar.pairwise.weights.ElementPriority(*args, **kwds)

Bases: enum.Enum

Prioritize multiple matches.

When multiple match elements are used (each with their own columns and weights), then this enum describes which weight should be chosen.

When MAX is used, take the maximum weight of all the match elements.

When FIRST is used, take the first non-null weight of all the match elements.

Null values appear in the weight column if it is computed on null values in the columns or a whole column in the weight calculation is missing. By default, null is carried through to the weight element, but this can be changed by the columns or the weight element itself.

FIRST = 'FIRST'
MAX = 'MAX'
class agglovar.pairwise.weights.WeightColumn(col: str, weight: float, scale: float | tuple[float | None, float | None] | tuple[float | None, float | None, bool] | None = None, missing: float | None = None)

An element representing the weight for one column.

Variables:
  • col – Column name

  • weight – Weight of the column

  • max_value – Maximum value of the column (if None, no max value is used).

  • missing – Use this value if the column is missing or any columns contain a null value.

__eq__(other: object) bool

Determine if this weight column is equal to another.

__repr__() str

Return a string representation of this weight column.

col: str
property expr: polars.Expr

Return a Polars expression that computes this column’s weighted contribution.

invert: bool
max_value: float | None
min_value: float | None
missing: float | None
weight: float
class agglovar.pairwise.weights.WeightElement(*columns: WeightColumn | Mapping[str, Any] | Iterable[Any], missing: float | None = None)

Bases: collections.abc.Container[WeightColumn]

Represents one strategy for computing weights.

Variables:
  • columns – Columns and their weights.

  • missing – Use this value if the weighted columns sum to null.

__contains__(item: object) bool

Return True if a column is in this element.

__eq__(other: object) bool

Determine if this weight element is equal to another.

__repr__() str

Get a string representation of this weight element.

property cols: tuple[str, Ellipsis]

Get the columns used by this element.

columns: tuple[WeightColumn, Ellipsis]
property expr: polars.Expr

Get an expression for computing weights.

property max_weight: float

Get the maximum weight sum this element can generate.

property min_weight: float

Get the minimum weight sum this element can generate.

missing: float | None = None
class agglovar.pairwise.weights.WeightStrategy(*elements: Iterable[WeightElement] | Mapping[str, Any] | Iterable[Any], missing: float | None = 0.0, priority: ElementPriority | str | None = ElementPriority.MAX)

Bases: collections.abc.Container[WeightElement]

Represents one strategy for computing weights.

__contains__(item: object) bool

Determine if a weight element is in this strategy.

__eq__(other: object) bool

Determine if this weight strategy is equal to another.

__repr__() str

Get a string representation.

elements: tuple[WeightElement, Ellipsis]
property expr: polars.Expr

Get and expression for computing weights.

missing: float | None
priority: ElementPriority
agglovar.pairwise.weights.DEFAULT_WEIGHT_STRATEGY: WeightStrategy

Default strategy for computing weights.