agglovar.pairwise.weights ========================= .. py:module:: agglovar.pairwise.weights .. autoapi-nested-parse:: Weights for pairwise joins. Provides a mechanism for weighing and scoring pairwise joins. A weight assigned to each pair of matched variants can help to prioritize the matches. For example, if variant with ID "A1" matches "B1" with a weight of 0.5 and "B2" with a weight of 0.75, then the "B2" match might be chosen over the "B1" match for variant "A1". Terminology: * Weight column: One column used to compute weights, each with their own weight. * Weight element: Computes a weight for each row using by multiplying each column by its assigned weight and summing. * Weight strategy: One or more weight elements, each computing their own weight. Each weight element extracts a set of columns, multiplies each by a specified weight, and sums the results producing one weight per row of the join table. For example, one weight strategy might have these column weights: ====== ====== Column Weight ====== ====== ro 0.5 match 0.5 ====== ====== In this case, the weight for each row is computed as (0.5 * ro) + (0.5 * match). Both "ro" and "match" have a maximum value of 1, so the total weighted sum is between 0.0 and 1.0. The columns attributes are: * col: Column name. * weight: Column weight. * scale: Column scale. * missing: Missing value. The column scale is used to rescale the column to a range between 0.0 and 1.0. It can take several forms: * float: Maximum value. Minimun value is 0.0. * (float, float): Minimum and maximum values. * (float, float, bool): Minimum and maximum values, and a flag for inverting the range after scaling. Values are first clipped to within the [min, max] range, then scaled (i.e. (value - min) / (max - min)). If the range is inverted, then the scaled value is inverted (i.e. 1 - scaled_value). Inverting values is only support if both min and max are defined. Finally, the scaled value is multiplied by the column weight. For columns that are not scaled, no assumptions are made about their range, they are left as-is and multiplied by the weight. No assumptions are made about weights, they could be negative or greater than 1. A good design should typically keep weights within [0.0, 1.0], although Agglovar allows designers to make choices about weights the developers might not have anticipated. Use this power with care. Agglovar allows flexible weight strategies where multiple weight elements are used. For example, it might make sense to compute another set of weights on "size_ro" and "offset_prop" for smaller variants. Any row with a null value in a weight column causes the weighted sum to become null. By default, this causes that weight element to be skipped. For example, assume that "match" might not have been computed, so the column is null for all values. A weight strategy can use multiple elements, one with a "match" weight and one without. In this case, the weight element computed on "match" will be ignored and the weight will automatically be computed on other weight elements. When multiple match elements are used in a strategy, the maximum computed non-null weight is used. If all weights are null for a row, then it gets 0.0. This strategy can be changed to accept the first non-null weight instead of the maximum. Null behavior can also be set per column and per weight element by specifying a value for "missing", to fill in missing values. For example, "missing=0.0" if null values in a column should not cause the whole weight element to be discarded. The weight strategy also has a missing value set to 0.0 by default, so if all weights compute null, the result is 0.0. Set missing=None for the strategy to produce nulls for rows where there were no computable weights. A weight strategy can be specified in a number of ways. Each level of the strategy has an object, :class:`WeightColumn`, :class:`WeightElement`, and :class:`WeightStrategy`. Each one can take an iterator (usually a tuple or a list) or a dictionary. Each of these is fed to the clas constructor. For example, a weight column could be specified as "('ro', 0.5)" or "{'col'='ro': 'weight'=0.5}". A weight element could be specified as a list of columns, such as "(('ro', 0.2), ('match', 0.5))" or as a dictionary where with constructor arguments, which would be required to set "missing", such as "{'columns': (('ro', 0.2), ('match': 0.5)), 'missing': 0.0}" An example for constructing using tuples: .. code-block:: python WeightStrategy( *( ( ('ro', 0.2), ('size_ro', 0.2), ('offset_prop', 0.1, (0.0, 2.0, True)), ('match_prop', 0.5), ), ( ('ro', 0.4, None), ('size_ro', 0.4, None), ('offset_prop', 0.2, (0.0, 2.0, True)), ), ) ) And an equivalent form using a mix of dicts and tuples, but with some "missing" values set for illustration purposes: .. code-block:: python WeightStrategy( *( { 'columns': ( ('ro', 0.2), ('size_ro', 0.2), ('offset_prop', 0.1, (0.0, 2.0, True)), ('match_prop', 0.5), ), 'missing': 0.0, }, ( {'col': 'ro', 'weight': 0.4}, ('size_ro', 0.4, None), ('offset_prop', 0.2, (0.0, 2.0, True)), ), ), priority='MAX' ) Attributes ---------- .. autoapisummary:: agglovar.pairwise.weights.DEFAULT_WEIGHT_STRATEGY Classes ------- .. autoapisummary:: agglovar.pairwise.weights.ElementPriority agglovar.pairwise.weights.WeightColumn agglovar.pairwise.weights.WeightElement agglovar.pairwise.weights.WeightStrategy Module Contents --------------- .. py:class:: ElementPriority(*args, **kwds) Bases: :py:obj:`enum.Enum` Prioritize multiple matches. When multiple match elements are used (each with their own columns and weights), then this enum describes which weight should be chosen. When MAX is used, take the maximum weight of all the match elements. When FIRST is used, take the first non-null weight of all the match elements. Null values appear in the weight column if it is computed on null values in the columns or a whole column in the weight calculation is missing. By default, null is carried through to the weight element, but this can be changed by the columns or the weight element itself. .. py:attribute:: FIRST :value: 'FIRST' .. py:attribute:: MAX :value: 'MAX' .. py:class:: WeightColumn(col: str, weight: float, scale: Optional[float | tuple[Optional[float], Optional[float]] | tuple[Optional[float], Optional[float], bool]] = None, missing: Optional[float] = None) An element representing the weight for one column. :ivar col: Column name :ivar weight: Weight of the column :ivar max_value: Maximum value of the column (if None, no max value is used). :ivar missing: Use this value if the column is missing or any columns contain a null value. .. py:method:: __eq__(other: object) -> bool Determine if this weight column is equal to another. .. py:method:: __repr__() -> str Return a string representation of this weight column. .. py:attribute:: col :type: str .. py:property:: expr :type: polars.Expr Return a Polars expression that computes this column's weighted contribution. .. py:attribute:: invert :type: bool .. py:attribute:: max_value :type: Optional[float] .. py:attribute:: min_value :type: Optional[float] .. py:attribute:: missing :type: Optional[float] .. py:attribute:: weight :type: float .. py:class:: WeightElement(*columns: WeightColumn | Mapping[str, Any] | Iterable[Any], missing: Optional[float] = None) Bases: :py:obj:`collections.abc.Container`\ [\ :py:obj:`WeightColumn`\ ] Represents one strategy for computing weights. :ivar columns: Columns and their weights. :ivar missing: Use this value if the weighted columns sum to null. .. py:method:: __contains__(item: object) -> bool Return True if a column is in this element. .. py:method:: __eq__(other: object) -> bool Determine if this weight element is equal to another. .. py:method:: __repr__() -> str Get a string representation of this weight element. .. py:property:: cols :type: tuple[str, Ellipsis] Get the columns used by this element. .. py:attribute:: columns :type: tuple[WeightColumn, Ellipsis] .. py:property:: expr :type: polars.Expr Get an expression for computing weights. .. py:property:: max_weight :type: float Get the maximum weight sum this element can generate. .. py:property:: min_weight :type: float Get the minimum weight sum this element can generate. .. py:attribute:: missing :type: Optional[float] :value: None .. py:class:: WeightStrategy(*elements: Iterable[WeightElement] | Mapping[str, Any] | Iterable[Any], missing: Optional[float] = 0.0, priority: Optional[ElementPriority | str] = ElementPriority.MAX) Bases: :py:obj:`collections.abc.Container`\ [\ :py:obj:`WeightElement`\ ] Represents one strategy for computing weights. .. py:method:: __contains__(item: object) -> bool Determine if a weight element is in this strategy. .. py:method:: __eq__(other: object) -> bool Determine if this weight strategy is equal to another. .. py:method:: __repr__() -> str Get a string representation. .. py:attribute:: elements :type: tuple[WeightElement, Ellipsis] .. py:property:: expr :type: polars.Expr Get and expression for computing weights. .. py:attribute:: missing :type: Optional[float] .. py:attribute:: priority :type: ElementPriority .. py:data:: DEFAULT_WEIGHT_STRATEGY :type: WeightStrategy Default strategy for computing weights.