agglovar.pairwise.weights
Weights for pairwise joins.
Provides a mechanism for weighing and scoring pairwise joins. A weight assigned to each pair of matched variants can help to prioritize the matches. For example, if variant with ID “A1” matches “B1” with a weight of 0.5 and “B2” with a weight of 0.75, then the “B2” match might be chosen over the “B1” match for variant “A1”.
Terminology:
Weight column: One column used to compute weights, each with their own weight.
- Weight element: Computes a weight for each row using by multiplying each column by its
assigned weight and summing.
Weight strategy: One or more weight elements, each computing their own weight.
Each weight element extracts a set of columns, multiplies each by a specified weight, and sums the results producing one weight per row of the join table.
For example, one weight strategy might have these column weights:
Column |
Weight |
|---|---|
ro |
0.5 |
match |
0.5 |
In this case, the weight for each row is computed as (0.5 * ro) + (0.5 * match). Both “ro” and “match” have a maximum value of 1, so the total weighted sum is between 0.0 and 1.0.
The columns attributes are:
col: Column name.
weight: Column weight.
scale: Column scale.
missing: Missing value.
The column scale is used to rescale the column to a range between 0.0 and 1.0. It can take several forms:
float: Maximum value. Minimun value is 0.0.
(float, float): Minimum and maximum values.
- (float, float, bool): Minimum and maximum values, and a flag for inverting the range
after scaling.
Values are first clipped to within the [min, max] range, then scaled (i.e. (value - min) / (max - min)). If the range is inverted, then the scaled value is inverted (i.e. 1 - scaled_value). Inverting values is only support if both min and max are defined. Finally, the scaled value is multiplied by the column weight. For columns that are not scaled, no assumptions are made about their range, they are left as-is and multiplied by the weight.
No assumptions are made about weights, they could be negative or greater than 1. A good design should typically keep weights within [0.0, 1.0], although Agglovar allows designers to make choices about weights the developers might not have anticipated. Use this power with care.
Agglovar allows flexible weight strategies where multiple weight elements are used. For example, it might make sense to compute another set of weights on “size_ro” and “offset_prop” for smaller variants.
Any row with a null value in a weight column causes the weighted sum to become null. By default, this causes that weight element to be skipped. For example, assume that “match” might not have been computed, so the column is null for all values. A weight strategy can use multiple elements, one with a “match” weight and one without. In this case, the weight element computed on “match” will be ignored and the weight will automatically be computed on other weight elements.
When multiple match elements are used in a strategy, the maximum computed non-null weight is used. If all weights are null for a row, then it gets 0.0. This strategy can be changed to accept the first non-null weight instead of the maximum. Null behavior can also be set per column and per weight element by specifying a value for “missing”, to fill in missing values. For example, “missing=0.0” if null values in a column should not cause the whole weight element to be discarded. The weight strategy also has a missing value set to 0.0 by default, so if all weights compute null, the result is 0.0. Set missing=None for the strategy to produce nulls for rows where there were no computable weights.
A weight strategy can be specified in a number of ways. Each level of the strategy has an object,
WeightColumn, WeightElement, and WeightStrategy. Each one can take
an iterator (usually a tuple or a list) or a dictionary. Each of these is fed to the clas
constructor.
For example, a weight column could be specified as “(‘ro’, 0.5)” or “{‘col’=’ro’: ‘weight’=0.5}”.
A weight element could be specified as a list of columns, such as “((‘ro’, 0.2), (‘match’, 0.5))” or as a dictionary where with constructor arguments, which would be required to set “missing”, such as “{‘columns’: ((‘ro’, 0.2), (‘match’: 0.5)), ‘missing’: 0.0}”
An example for constructing using tuples:
WeightStrategy(
*(
(
('ro', 0.2),
('size_ro', 0.2),
('offset_prop', 0.1, (0.0, 2.0, True)),
('match_prop', 0.5),
), (
('ro', 0.4, None),
('size_ro', 0.4, None),
('offset_prop', 0.2, (0.0, 2.0, True)),
),
)
)
And an equivalent form using a mix of dicts and tuples, but with some “missing” values set for illustration purposes:
WeightStrategy(
*(
{
'columns': (
('ro', 0.2),
('size_ro', 0.2),
('offset_prop', 0.1, (0.0, 2.0, True)),
('match_prop', 0.5),
),
'missing': 0.0,
}, (
{'col': 'ro', 'weight': 0.4},
('size_ro', 0.4, None),
('offset_prop', 0.2, (0.0, 2.0, True)),
),
),
priority='MAX'
)
Attributes
Default strategy for computing weights. |
Classes
Prioritize multiple matches. |
|
An element representing the weight for one column. |
|
Represents one strategy for computing weights. |
|
Represents one strategy for computing weights. |
Module Contents
- class agglovar.pairwise.weights.ElementPriority(*args, **kwds)
Bases:
enum.EnumPrioritize multiple matches.
When multiple match elements are used (each with their own columns and weights), then this enum describes which weight should be chosen.
When MAX is used, take the maximum weight of all the match elements.
When FIRST is used, take the first non-null weight of all the match elements.
Null values appear in the weight column if it is computed on null values in the columns or a whole column in the weight calculation is missing. By default, null is carried through to the weight element, but this can be changed by the columns or the weight element itself.
- FIRST = 'FIRST'
- MAX = 'MAX'
- class agglovar.pairwise.weights.WeightColumn(col: str, weight: float, scale: float | tuple[float | None, float | None] | tuple[float | None, float | None, bool] | None = None, missing: float | None = None)
An element representing the weight for one column.
- Variables:
col – Column name
weight – Weight of the column
max_value – Maximum value of the column (if None, no max value is used).
missing – Use this value if the column is missing or any columns contain a null value.
- property expr: polars.Expr
Return a Polars expression that computes this column’s weighted contribution.
- class agglovar.pairwise.weights.WeightElement(*columns: WeightColumn | Mapping[str, Any] | Iterable[Any], missing: float | None = None)
Bases:
collections.abc.Container[WeightColumn]Represents one strategy for computing weights.
- Variables:
columns – Columns and their weights.
missing – Use this value if the weighted columns sum to null.
- columns: tuple[WeightColumn, Ellipsis]
- property expr: polars.Expr
Get an expression for computing weights.
- class agglovar.pairwise.weights.WeightStrategy(*elements: Iterable[WeightElement] | Mapping[str, Any] | Iterable[Any], missing: float | None = 0.0, priority: ElementPriority | str | None = ElementPriority.MAX)
Bases:
collections.abc.Container[WeightElement]Represents one strategy for computing weights.
- elements: tuple[WeightElement, Ellipsis]
- property expr: polars.Expr
Get and expression for computing weights.
- priority: ElementPriority
- agglovar.pairwise.weights.DEFAULT_WEIGHT_STRATEGY: WeightStrategy
Default strategy for computing weights.