agglovar.pairwise.overlap

Generate pairwise overlaps between two variant tables.

For a pair of variant tables (A and B), find all pairs of variants between A and B that meet a set of overlap parameters. Each pair of variants appears at most once, and a distinct variant in either table may have multiple matches.

The join is extremely flexible and customizable. A default set of parameters should be adequate for almost all tasks, and each stage of the join algorithm can be customized for more complex cases.

Input variant tables

Variants are input as Polars tables (polars.DataFrame or polars.LazyFrame) conforming to schema agglovar.schema.VARIANT.

Warning

To avoid extremely high CPU and memory usage, input tables should be sorted by “chrom”, “pos”, then “end”. Because joins are chunked by chromosome, the chromosome sort order will not affect performance if records for each chromosome are grouped and sorted. If numeric chromosome names are sorted numerically (not default or advised), the join is not impacted, although the join table will be ordered by chromosome lexicographically (i.e. “1” < “10” < “2”…).

Warning

Input tables are processed lazily, but must collect results periodically to avoid memory exhaustion. If lazy tables have been transformed since they were read into memory, those transformations will be repeated with each processed chunk. Consider collecting results if they fit into memory, and if not, write/sink to a temporary parquet file and join the re-scanned temporary files.

Note

For best performance, read sorted input lazily (i.e. “scan_parquet()”) or join tables in memory if they are small enough (i.e. “collected” tables). Pushdown operations on parquet files will avoid re-reading while files, and only required columns will be extracted. When reading from non-parquet files, read tables into memory and join them. Joining unsorted tables or tables scanned from non-parquet files will work, but will incur a significant performance penalty on large tables.

Note

No assumptions are made about the order of columns, they are referenced strictly by name. While “chrom”, “pos”, and “end” are typically the leading columns, it is not required.

Input tables are df_a and df_b where “a” is the left (first) table and “b” is the right (second) table. Tables are treated the same by built-in join rules, and reversing their order should produce equivalent results. If standard join parameters or custom parameters unexpectedly produce violate this condition, please report it using the GitHub issue page.

Joins take advantage of lazy evaluation and Polars’s query optimization. Large parquet tables can be read with polars.scan_parquet() and joined efficiently.

Warning

Reserved columns names start with “_” and may be replaced by the join process. Current reserved column names are in RESERVED_COLS, which may change with any update. If these are present in an input table, a warning is generated, and the column is replaced. To avoid unexpected behavior, or drop or rename columns with a leading “_” before joining (using a polars.LazyFrame will avoid duplicating the table).

Note

An input variant table can be checked for errors by calling PairwiseOverlap.prepare_tables() and catching ValueError. To check a single table, call this method with df_a and df_b set to the same object.

Output join tables

The resulting join table describes each pairwise overlaps, but does not contain whole variant records. Columns referring to a distinct input table have suffix “_a” or “_b” (e.g. “index_a” is the row index in df_a).

These tables can be filtered or merged with other pairwise join tables, then joined with the original variant tables by the row index (“index_a” and “index_b”) or by variant IDs (“id_a” and “id_b”) if the IDs are guaranteed to be unique in each callset.

Join columns:
  1. index_a: Row index in df_a.

  2. index_b: Row index in df_b

  3. id_a: Variant ID in df_a.

  4. id_b: Variant ID in df_b.

  5. ro: Reciprocal overlap if variants overlap (0.0 if no overlap).

  6. size_ro: Size reciprocal overlap (maximum RO if variants were shifted to maximally overlap).

  7. offset_dist: The maximum of the start position distance and end position distance.

  8. offset_prop: Offset / variant length. Variant length is the minimum length of the two variants in the record.

  9. match_prop: Alignment match score divided by the maximum alignment match score

Row indexes start at 1 and increment by 1 for each row in the input table regardless of if the row is part of a join or not. This is consistent with the behavior of polars.DataFrame.with_row_index() and can be used for joining with the original variant table.

The ID columns may also be used for joining with the input variant tables if they are unique in each callset, although this cannot be guaranteed by Agglovar. If the “id” column is not in the input tables, these values are null.

Additional join columns may be defined, see Customizing the join process below for details.

Join parameters

A set of join parameters are defined in the PairwiseOverlap class. These parameters should cover most joins, although they can be replaced or augmented with custom expressions (see Customizing the join process below for details).

Warning

Without join parameters, a cross-join is performed without warning (all combinations of variants in both tables). This may results in a very large join table and very poor performance.

Minimum reciprocal overlap (ro_min)

ro_min: The minimum reciprocal-overlap (RO) between variants.

RO is defined as the number of bases overlapping in reference coordinates divided by the maximum length of the two variants. Acceptable values are in the range [0.0, 1.0] where “0.0” means any match, and “1.0” forces the start position and variant size to match exactly.

While this can be computed using “pos” and “end” for most variant types, the end position is defined as pos + varlen to support insertions. Setting force_end_ro to True in PairwiseOverlap will force the existing end position to be used, but it will break insertion overlaps. Unless “pos + varlen != end” for some valid reason, force_end_ro should never be used.

Size reciprocal overlap (size_ro_min)

size_ro_min: Minimum size-RO. Size-RO is defined as the minimum “varlen” divided by the maximum “varlen” of two variants. Size-RO is similar to RO if one variant was shifted to maximally overlap the other (i.e. shift to the same start position). It does not reqire that the variant overlap in reference space.

Warning

Using Size-RO without criteria limiting the distance of variants, such as “ro” or “offset_max” will result in a large number of joins and may dramatically impact performance.

Maximum offset (offset_max)

offset_max: Maximum offset allowed (minimum of start or end position distance).

The offset between two variants is defined as the minimum of the start or end position distance between the two variants.

offset_distance = max(
    abs(var_a.pos - var_b.pos),
    abs(var_a.end - var_b.end)
)

Maximum offset proportion (offset_prop_max)

offset_prop_max: Maximum offset proportion allowed.

The offset proportion is defined as the offset distance divided by the mimum length of the two variants being compared.

Match reference base (match_ref)

match_ref: Match reference base (“ref” column). Only defined for SNVs.

Setting match_ref forces the reference base for SNVs to match. If SNV matches allow for an offset distance, setting this parameter is advised to avoid nonsensical matches (i.e. an A->C SNV should not match a G->T SNV). If offset_max is 0, this parameter should not have an effect unless the “ref” column is malformed.

Match alternate base (match_alt)

match_alt: Match alternate base (“alt” column). Only defined for SNVs.

Setting match_alt forces alternate bases for SNVs to match and is advised for all SNV matching.

Minimum sequence match proportion (match_prop_min)

match_prop_min: Minimum sequence match proportion.

When variant overlaps use only size and position, it is difficult to control false-positive and false-negative matches while permitting normal genetic variation. Setting match_prop_min will force a match threshold based on variant sequence similarity. Using permissive parameters on size and position to reduce false negatives with a strict match proportion to reduce false positives is one powerful join strategy.

Match proportion uses an alignment scores between two variant sequences. A score model (match_score_model parameter of PairwiseOverlap) defines how to score alignments based on matches, mismatches, and gaps. The match proportion is defined as the alignment score between two sequence divided by the maximum alignment score that could be obtained if every base matched.

match_prop = alignment_score / (match_score * min(var_a.varlen, var_b.varlen))

Where match_score is the value added for each matching base in the alignment (ScoreModel.match()).

Using a match proportion avoids using direct alignment scores, which are not intuitive and are significantly affected by sequence lengths how alignments are scored.

When variants are shifted in one sample relative to another, the variant sequence is “rotated” making direct sequence comparisons difficult. However, Agglovar corrects for this sequence representation. See Small polymorphisms are a source of ancestral bias in structural variant breakpoint placement for more details about why this occurs.

Both Agglovar and Truvari have a sequence match criteria, but they are different and both have merits. While Agglovar compares only variant sequences even if they are shifted, Truvari includes the shift in the sequence match (i.e. the compared sequences include the variant and flanking reference bases).

Reasonable join parameters

While there are no “optimal” join parameters, these are parameters we have found to work well.

SVs (INS, DEL, INV, DUP >= 50 bp):

Parameter

Value

ro_min

0.5

match_prop_min

0.8

indels (INS, DEL < 50 bp):

Parameter

Value

size_ro_min

0.8

offset_max

200

match_prop_min

0.8

SNVs:SVs (INS, DEL, INV, DUP >= 50 bp):

Parameter

Value

offset_max

0

match_ref

True

Join process

The initial join stage is performed using the following logic:

(
    df_a
    .join_where(
        df_b,
        *join_predicates
    )
    .select(*join_cols)
    .filter(*join_filters)
    .sort(['index_a', 'index_b'])
)
This expression contains several key components:
  • join_predicates: A list of predicates to applied during the join.

  • join_cols: A list of columns to select from the joined table.

  • join_filters: A list of filters to apply to the join.

Join predicates (join_predicates)are expressions on df_a and df_b that determine which rows are joined. At this stage, column names will have “_a” or “_b” suffixes to indicate their origin.

Join columns (join_cols) is a list of expressions that generate the final join table. These expressions may be a single column, such as id_a, or an expression that generates a new column, such as “ro” (derived from existing position and end columns).

Join filters (join_filters) are expressions that filter the joined table, and they operate on columns created by Join columns.

More on the join strategy

With lazy evaluation and Polars query optimization, expressions may be used at multiple stages. For example, if a minimum “ro” (reciprocal overlap) is set, then the expression that calculates “ro” is used as a join predicate to limit the size of joins and again as a join column.

Alternatively, the expression generating “ro” could be used once as a join column, then the “ro” result could be used as a filter at the filter stage, however, applying the filter at the predicate stage may eliminate pairs of variants before they are processed through remaining stages.

This strategy also allows additional flexibility in the join process, for example, a filter can be enforced at the predicate stage and never appear as a column in the final table.

Note that currently, there is not a “drop” stage following the filter stage, so filters that do not use join columns must be applied as join predicates.

Customizing the join process

Most join operations can be defined simply by setting parameters (see Join parameters).

Join filters operate on the join columns and provide further filtering. These expressions are set by the join parameters passed to the PairwiseOverlap constructor. Additional expressions can also be added. These may affect join predicates, columns, or filters.

This is a powerful feature, and it should be used with caution. Use join parameters whenever possible, and augment them with additional expressions only when necessary. It is possible to use both, and the join parameters will come first.

When a PairwiseOverlap instance is created, additional expressions can be added to it until either PairwiseOverlap.lock() is called or a join is performed (which locks the object from further modifications).

Three methods can insert expressions:

  • PairwiseOverlap.append_join_predicates()

  • PairwiseOverlap.append_join_cols()

  • PairwiseOverlap.append_join_filters()

Each of these takes a single expression or an iterable object of expressions. Note that the output join table columns are affected by appended columns. This can be used to add additional statistics to the output table or to retain columns from the original tables (consider joining the output table with the original tables to retrieve columns in this case).

Expected columns (PairwiseOverlap.expected_cols) is updated automatically with each new expression and cannot be modified directly.

Join Class Inspection

A PairwiseOverlap instance has properties for inspecting the join process. These properties are set by join parameters and customized expressions.

Join inspection properties:
  • join_predicates: A list of join predicates expressions.

  • join_cols: A list of join column expressions.

  • join_filters: A list of filter expressions.

  • expected_cols: A list of column names expected in df_a and df_b.

The values in PairwiseOverlap.expected_cols() are automatically derived from expressions in PairwiseOverlap.join_predicates() and PairwiseOverlap.join_cols(). These columns must be either present in df_a and df_b, or they must be auto-generated columns (see AUTOGEN_COLS).

Note

These properties return copies of the instance’s internal lists. Modifying them will not affect the joins. See Customizing the join process below for information on altering the join process for complex use cases.

Attributes

AUTOGEN_COLS

Columns automatically generated by the join process for input tables if they are not present.

DEFAULT_CHUNK_SIZE

Default number of df_a variants per chunk. Chunking controls peak memory and determines the batch

DEFAULT_JOIN_COLS

List of the default join columns.

EXPR_OFFSET_DIST

Expression for computing offset distance.

EXPR_OFFSET_PROP

Expression for computing the offset proportion.

EXPR_OVERLAP_RO

Expression for overlap reciprocal overlap (RO).

EXPR_SZRO

Expression for computing reciprocal size overlap (SZRO).

INVARIANT_JOIN_COLS

Columns that appear in all joins regardless of parameters.

JOIN_COL_EXPR

Known join columns.

RESERVED_COLS

Reserved columns are added automatically to input tables.

__all__

Classes

PairwiseOverlap

Pairwise overlap class.

PairwiseOverlapStage

One join strategy implemented by overlap.

Package Contents

class agglovar.pairwise.overlap.PairwiseOverlap(stages: collections.abc.Iterable[agglovar.pairwise.overlap._stage.PairwiseOverlapStage], join_cols: collections.abc.Iterable[str | polars.Expr] | None = None, drop_default_join_cols: bool = False, match_score_model: agglovar.seqmatch.MatchScoreModel | None = None, weight_strategy: agglovar.pairwise.weights.WeightStrategy = DEFAULT_WEIGHT_STRATEGY, force_end_ro: bool = False, chunk_size: int | None = None, n_threads: int | None = None)

Bases: agglovar.pairwise.base.PairwiseJoin

Pairwise overlap class.

Join by overlapping variants by position.

Variables:
  • match_score_model – (Advanced) Configured model for scoring similarity between pairs of sequences. If None and match_prop_min is set, then a default aligner will be used.

  • force_end_ro – (Advanced) By default, reciprocal overlap is calculated with the end position set to the start position plus the variant length. For all variants except insertions, this will typically match the end value in the source DataFrame. If True, the end position in the DataFrame is also used for reciprocal overlap without changes. Typically, this option should not be used and will break reciprocal overlap for insertions.

  • chunk_size – (Advanced) Chunk df_a into partitions of this size, and for each chunk, subset df_b to include only variants that may overlap with variants in the chunk. If None, each chromosome is a single chunk, which will lead to a combinatorial explosion unless offset_max is greater than 0.

classmethod from_definition(overlap_def: collections.abc.Mapping[str, Any] | collections.abc.Iterable[collections.abc.Mapping[str, Any]])

Build a PairwiseOverlap instance from a stage definition mapping or sequence of mappings.

join_iter(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) collections.abc.Iterator[polars.LazyFrame]

Find all pairs of variants in two sources that meet a set of criteria.

Parameters:
  • df_a – Source dataframe.

  • df_b – Target dataframe.

  • retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.

  • temp_dir – How to materialise the prepared tables before the chunked loop. False (default) collects both into memory; True writes them to the system temp directory as parquet files; a str/Path writes them to that directory. Temp files are always removed on exit.

Yields:

A LazyFrame for each chunk.

chunk_size: int | None
compute_seg_ro: bool
compute_weight: bool
equi_join_exprs: tuple[polars.Expr, Ellipsis]
expected_cols: frozenset[str]
force_end_ro: bool
property has_equi_join: bool

True if all stages have an equivalent equi-join expression.

property has_match: bool

True if any stage performs sequence matching.

property has_seg_ro: bool

True if any stage computes segment reciprocal overlap.

property is_equi_offset: bool

True if all stages have an equivalent equi-join expression for position.

join_col_exprs: tuple[polars.Expr, Ellipsis]
join_cols: tuple[polars.Expr, Ellipsis]
property match_prop_expr

Polars expression that computes the per-pair match proportion.

match_score_model: agglovar.seqmatch.MatchScoreModel
n_threads: int | None
property required_cols: set[str]

The minimum set of columns that must be present in input tables.

This is set based on parameters needed to perform the join. For example, if sequence matching is required, then “seq” will be in this list, and if “seq” does not exist in both df_a and df_b, then an error is raised.

property reserved_cols: set[str]

Columns reserved for internal use during the join.

stages: tuple[agglovar.pairwise.overlap._stage.PairwiseOverlapStage, Ellipsis]
property weight_expr

Polars expression for the configured weight strategy.

class agglovar.pairwise.overlap.PairwiseOverlapStage(*, ro_min: float | None = None, size_ro_min: float | None = None, offset_max: int | None = None, offset_prop_max: float | None = None, seg_ro_min: float | None = None, match_ref: bool = False, match_alt: bool = False, match_vartype: bool = False, match_filter_pass: bool = False, match_prop_min: float | None = None, join_predicates: collections.abc.Iterable[polars.Expr] | None = None, join_filters: collections.abc.Iterable[polars.Expr] | None = None)

One join strategy implemented by overlap.

Variables:
  • ro_min – Minimum reciprocal overlap for allowed matches. If 0.0, then any overlap matches.

  • size_ro_min – Reciprocal length proportion of allowed matches. If match_prop_min is set and the value of this parameter is None or is less than match_prop_min, then it is set to match_prop_min since this value represents the lower-bound of allowed match proportions.

  • offset_max – Maximum offset allowed (minimum of start or end position distance).

  • offset_prop_max – Maximum size-offset (offset / varlen) allowed.

  • seg_ro_min – Minimum reciprocal overlap by segments for merging complex variants. Compares where segments are aligned for each complex variant and computes a reciprocal-overlap for aligned segments between two variants. Unaligned bases overlap by default (i.e. minimum total unaligned bases is in the numerator of the overlap calculation).

  • match_ref – “REF” column must match between two variants.

  • match_alt – “ALT” column must match between two variants.

  • match_vartype – “VARTYPE” column must match between two variants.

  • match_filter_pass – If true, then matched variants must either pass filters (“filter” column emtpy for both variants) or must fail filters (“filter” column non-empty for both variants). This does not compare the contents of the “filter” column, just the presence or absence of a filter.

  • match_prop_min – Minimum matched base proportion in alignment or None to not match.

__repr__() str

Return a string representation of the stage configuration.

chunk_range: tuple[tuple[str, str, tuple[polars.Expr, Ellipsis]], Ellipsis]
property has_match: bool

True if this stage requires sequence-match scoring.

property has_seg_ro: bool

True if this stage uses segment reciprocal-overlap matching.

join_filters: tuple[polars.Expr, Ellipsis]
join_predicates: tuple[polars.Expr, Ellipsis]
match_alt: bool
match_filter_pass: bool
match_prop_min: float | None
match_ref: bool
match_vartype: bool
offset_max: int | None
offset_prop_max: float | None
pre_select_filters: tuple[polars.Expr, Ellipsis]
ro_min: float | None
seg_ro_min: float | None
size_ro_min: float | None
agglovar.pairwise.overlap.AUTOGEN_COLS: frozenset[str]

Columns automatically generated by the join process for input tables if they are not present.

agglovar.pairwise.overlap.DEFAULT_CHUNK_SIZE: int = 1500

Default number of df_a variants per chunk. Chunking controls peak memory and determines the batch size presented to the sequence-alignment thread pool — larger values amortize thread-pool overhead better but use more memory. Tune alongside n_threads based on available RAM and CPU count.

agglovar.pairwise.overlap.DEFAULT_JOIN_COLS: list[str]

List of the default join columns.

agglovar.pairwise.overlap.EXPR_OFFSET_DIST

Expression for computing offset distance.

agglovar.pairwise.overlap.EXPR_OFFSET_PROP

Expression for computing the offset proportion.

agglovar.pairwise.overlap.EXPR_OVERLAP_RO

Expression for overlap reciprocal overlap (RO).

agglovar.pairwise.overlap.EXPR_SZRO

Expression for computing reciprocal size overlap (SZRO).

agglovar.pairwise.overlap.INVARIANT_JOIN_COLS: list[str] = ['index_a', 'index_b']

Columns that appear in all joins regardless of parameters.

agglovar.pairwise.overlap.JOIN_COL_EXPR: dict[str, polars.Expr | collections.abc.Callable[[Any], polars.Expr]]

Known join columns.

agglovar.pairwise.overlap.RESERVED_COLS: frozenset[str]

Reserved columns are added automatically to input tables.

agglovar.pairwise.overlap.__all__