agglovar.pairwise.overlap
Generate pairwise overlaps between two variant tables.
For a pair of variant tables (A and B), find all pairs of variants between A and B that meet a set of overlap parameters. Each pair of variants appears at most once, and a distinct variant in either table may have multiple matches.
The join is extremely flexible and customizable. A default set of parameters should be adequate for almost all tasks, and each stage of the join algorithm can be customized for more complex cases.
Input variant tables
Variants are input as Polars tables (polars.DataFrame or polars.LazyFrame)
conforming to schema agglovar.schema.VARIANT.
Warning
To avoid extremely high CPU and memory usage, input tables should be sorted by “chrom”, “pos”, then “end”. Because joins are chunked by chromosome, the chromosome sort order will not affect performance if records for each chromosome are grouped and sorted. If numeric chromosome names are sorted numerically (not default or advised), the join is not impacted, although the join table will be ordered by chromosome lexicographically (i.e. “1” < “10” < “2”…).
Warning
Input tables are processed lazily, but must collect results periodically to avoid memory exhaustion. If lazy tables have been transformed since they were read into memory, those transformations will be repeated with each processed chunk. Consider collecting results if they fit into memory, and if not, write/sink to a temporary parquet file and join the re-scanned temporary files.
Note
For best performance, read sorted input lazily (i.e. “scan_parquet()”) or join tables in memory if they are small enough (i.e. “collected” tables). Pushdown operations on parquet files will avoid re-reading while files, and only required columns will be extracted. When reading from non-parquet files, read tables into memory and join them. Joining unsorted tables or tables scanned from non-parquet files will work, but will incur a significant performance penalty on large tables.
Note
No assumptions are made about the order of columns, they are referenced strictly by name. While “chrom”, “pos”, and “end” are typically the leading columns, it is not required.
Input tables are df_a and df_b where “a” is the left (first) table and “b” is the right
(second) table. Tables are treated the same by built-in join rules, and reversing their order
should produce equivalent results. If standard join parameters or custom parameters
unexpectedly produce violate this condition, please report it using the
GitHub issue page.
Joins take advantage of lazy evaluation and Polars’s query optimization. Large parquet tables can
be read with polars.scan_parquet() and joined efficiently.
Warning
Reserved columns names start with “_” and may be replaced by the join process. Current reserved
column names are in RESERVED_COLS, which may change with any update. If these are
present in an input table, a warning is generated, and the column is replaced. To avoid
unexpected behavior, or drop or rename columns with a leading “_” before joining (using a
polars.LazyFrame will avoid duplicating the table).
Note
An input variant table can be checked for errors by calling
PairwiseOverlap.prepare_tables() and catching ValueError. To check a single
table, call this method with df_a and df_b set to the same object.
Output join tables
The resulting join table describes each pairwise overlaps, but does not contain whole variant
records. Columns referring to a distinct input table have suffix “_a” or “_b” (e.g. “index_a” is
the row index in df_a).
These tables can be filtered or merged with other pairwise join tables, then joined with the original variant tables by the row index (“index_a” and “index_b”) or by variant IDs (“id_a” and “id_b”) if the IDs are guaranteed to be unique in each callset.
- Join columns:
index_a: Row index in df_a.
index_b: Row index in df_b
id_a: Variant ID in df_a.
id_b: Variant ID in df_b.
ro: Reciprocal overlap if variants overlap (0.0 if no overlap).
size_ro: Size reciprocal overlap (maximum RO if variants were shifted to maximally overlap).
offset_dist: The maximum of the start position distance and end position distance.
offset_prop: Offset / variant length. Variant length is the minimum length of the two variants in the record.
match_prop: Alignment match score divided by the maximum alignment match score
Row indexes start at 1 and increment by 1 for each row in the input table regardless of if the row
is part of a join or not. This is consistent with the behavior of
polars.DataFrame.with_row_index() and can be used for joining with the original variant
table.
The ID columns may also be used for joining with the input variant tables if they are unique in each callset, although this cannot be guaranteed by Agglovar. If the “id” column is not in the input tables, these values are null.
Additional join columns may be defined, see Customizing the join process below for details.
Join parameters
A set of join parameters are defined in the PairwiseOverlap class. These parameters
should cover most joins, although they can be replaced or augmented with custom expressions
(see Customizing the join process below for details).
Warning
Without join parameters, a cross-join is performed without warning (all combinations of variants in both tables). This may results in a very large join table and very poor performance.
Minimum reciprocal overlap (ro_min)
ro_min: The minimum reciprocal-overlap (RO) between variants.
RO is defined as the number of bases overlapping in reference coordinates divided by the maximum length of the two variants. Acceptable values are in the range [0.0, 1.0] where “0.0” means any match, and “1.0” forces the start position and variant size to match exactly.
While this can be computed using “pos” and “end” for most variant types, the end position is
defined as pos + varlen to support insertions. Setting force_end_ro to True in
PairwiseOverlap will force the existing end position to be used, but it will break
insertion overlaps. Unless “pos + varlen != end” for some valid reason, force_end_ro should
never be used.
Size reciprocal overlap (size_ro_min)
size_ro_min: Minimum size-RO. Size-RO is defined as the minimum “varlen” divided by the maximum “varlen” of two variants. Size-RO is similar to RO if one variant was shifted to maximally overlap the other (i.e. shift to the same start position). It does not reqire that the variant overlap in reference space.
Warning
Using Size-RO without criteria limiting the distance of variants, such as “ro” or “offset_max” will result in a large number of joins and may dramatically impact performance.
Maximum offset (offset_max)
offset_max: Maximum offset allowed (minimum of start or end position distance).
The offset between two variants is defined as the minimum of the start or end position distance between the two variants.
offset_distance = max(
abs(var_a.pos - var_b.pos),
abs(var_a.end - var_b.end)
)
Maximum offset proportion (offset_prop_max)
offset_prop_max: Maximum offset proportion allowed.
The offset proportion is defined as the offset distance divided by the mimum length of the two variants being compared.
Match reference base (match_ref)
match_ref: Match reference base (“ref” column). Only defined for SNVs.
Setting match_ref forces the reference base for SNVs to match. If SNV matches allow for an offset distance, setting this parameter is advised to avoid nonsensical matches (i.e. an A->C SNV should not match a G->T SNV). If offset_max is 0, this parameter should not have an effect unless the “ref” column is malformed.
Match alternate base (match_alt)
match_alt: Match alternate base (“alt” column). Only defined for SNVs.
Setting match_alt forces alternate bases for SNVs to match and is advised for all SNV matching.
Minimum sequence match proportion (match_prop_min)
match_prop_min: Minimum sequence match proportion.
When variant overlaps use only size and position, it is difficult to control false-positive and false-negative matches while permitting normal genetic variation. Setting match_prop_min will force a match threshold based on variant sequence similarity. Using permissive parameters on size and position to reduce false negatives with a strict match proportion to reduce false positives is one powerful join strategy.
Match proportion uses an alignment scores between two variant sequences. A score model
(match_score_model parameter of PairwiseOverlap) defines how to score alignments
based on matches, mismatches, and gaps. The match proportion is defined as the alignment score
between two sequence divided by the maximum alignment score that could be obtained if every base
matched.
match_prop = alignment_score / (match_score * min(var_a.varlen, var_b.varlen))
Where match_score is the value added for each matching base in the alignment
(ScoreModel.match()).
Using a match proportion avoids using direct alignment scores, which are not intuitive and are significantly affected by sequence lengths how alignments are scored.
When variants are shifted in one sample relative to another, the variant sequence is “rotated” making direct sequence comparisons difficult. However, Agglovar corrects for this sequence representation. See Small polymorphisms are a source of ancestral bias in structural variant breakpoint placement for more details about why this occurs.
Both Agglovar and Truvari have a sequence match criteria, but they are different and both have merits. While Agglovar compares only variant sequences even if they are shifted, Truvari includes the shift in the sequence match (i.e. the compared sequences include the variant and flanking reference bases).
Reasonable join parameters
While there are no “optimal” join parameters, these are parameters we have found to work well.
SVs (INS, DEL, INV, DUP >= 50 bp):
Parameter |
Value |
|---|---|
ro_min |
0.5 |
match_prop_min |
0.8 |
indels (INS, DEL < 50 bp):
Parameter |
Value |
|---|---|
size_ro_min |
0.8 |
offset_max |
200 |
match_prop_min |
0.8 |
SNVs:SVs (INS, DEL, INV, DUP >= 50 bp):
Parameter |
Value |
|---|---|
offset_max |
0 |
match_ref |
True |
Join process
The initial join stage is performed using the following logic:
(
df_a
.join_where(
df_b,
*join_predicates
)
.select(*join_cols)
.filter(*join_filters)
.sort(['index_a', 'index_b'])
)
- This expression contains several key components:
join_predicates: A list of predicates to applied during the join.
join_cols: A list of columns to select from the joined table.
join_filters: A list of filters to apply to the join.
Join predicates (join_predicates)are expressions on df_a and df_b that determine
which rows are joined. At this stage, column names will have “_a” or “_b” suffixes to indicate
their origin.
Join columns (join_cols) is a list of expressions that generate the final join table.
These expressions may be a single column, such as id_a, or an expression that generates a new
column, such as “ro” (derived from existing position and end columns).
Join filters (join_filters) are expressions that filter the joined table, and they operate
on columns created by Join columns.
More on the join strategy
With lazy evaluation and Polars query optimization, expressions may be used at multiple stages. For example, if a minimum “ro” (reciprocal overlap) is set, then the expression that calculates “ro” is used as a join predicate to limit the size of joins and again as a join column.
Alternatively, the expression generating “ro” could be used once as a join column, then the “ro” result could be used as a filter at the filter stage, however, applying the filter at the predicate stage may eliminate pairs of variants before they are processed through remaining stages.
This strategy also allows additional flexibility in the join process, for example, a filter can be enforced at the predicate stage and never appear as a column in the final table.
Note that currently, there is not a “drop” stage following the filter stage, so filters that do not use join columns must be applied as join predicates.
Customizing the join process
Most join operations can be defined simply by setting parameters (see Join parameters).
Join filters operate on the join columns and provide further filtering. These expressions are set
by the join parameters passed to the PairwiseOverlap constructor. Additional expressions
can also be added. These may affect join predicates, columns, or filters.
This is a powerful feature, and it should be used with caution. Use join parameters whenever possible, and augment them with additional expressions only when necessary. It is possible to use both, and the join parameters will come first.
When a PairwiseOverlap instance is created, additional expressions can be added to it
until either PairwiseOverlap.lock() is called or a join is performed (which locks the
object from further modifications).
Three methods can insert expressions:
PairwiseOverlap.append_join_predicates()PairwiseOverlap.append_join_cols()PairwiseOverlap.append_join_filters()
Each of these takes a single expression or an iterable object of expressions. Note that the output join table columns are affected by appended columns. This can be used to add additional statistics to the output table or to retain columns from the original tables (consider joining the output table with the original tables to retrieve columns in this case).
Expected columns (PairwiseOverlap.expected_cols) is updated automatically with each new
expression and cannot be modified directly.
Join Class Inspection
A PairwiseOverlap instance has properties for inspecting the join process. These
properties are set by join parameters and customized expressions.
- Join inspection properties:
join_predicates: A list of join predicates expressions.join_cols: A list of join column expressions.join_filters: A list of filter expressions.expected_cols: A list of column names expected indf_aanddf_b.
The values in PairwiseOverlap.expected_cols() are automatically derived from expressions in
PairwiseOverlap.join_predicates() and PairwiseOverlap.join_cols(). These columns
must be either present in df_a and df_b, or they must be auto-generated columns
(see AUTOGEN_COLS).
Note
These properties return copies of the instance’s internal lists. Modifying them will not affect the joins. See Customizing the join process below for information on altering the join process for complex use cases.
Attributes
Columns automatically generated by the join process for input tables if they are not present. |
|
Default number of df_a variants per chunk. Chunking controls peak memory and determines the batch |
|
List of the default join columns. |
|
Expression for computing offset distance. |
|
Expression for computing the offset proportion. |
|
Expression for overlap reciprocal overlap (RO). |
|
Expression for computing reciprocal size overlap (SZRO). |
|
Columns that appear in all joins regardless of parameters. |
|
Known join columns. |
|
Reserved columns are added automatically to input tables. |
|
Classes
Pairwise overlap class. |
|
One join strategy implemented by overlap. |
Package Contents
- class agglovar.pairwise.overlap.PairwiseOverlap(stages: collections.abc.Iterable[agglovar.pairwise.overlap._stage.PairwiseOverlapStage], join_cols: collections.abc.Iterable[str | polars.Expr] | None = None, drop_default_join_cols: bool = False, match_score_model: agglovar.seqmatch.MatchScoreModel | None = None, weight_strategy: agglovar.pairwise.weights.WeightStrategy = DEFAULT_WEIGHT_STRATEGY, force_end_ro: bool = False, chunk_size: int | None = None, n_threads: int | None = None)
Bases:
agglovar.pairwise.base.PairwiseJoinPairwise overlap class.
Join by overlapping variants by position.
- Variables:
match_score_model – (Advanced) Configured model for scoring similarity between pairs of sequences. If None and match_prop_min is set, then a default aligner will be used.
force_end_ro – (Advanced) By default, reciprocal overlap is calculated with the end position set to the start position plus the variant length. For all variants except insertions, this will typically match the end value in the source DataFrame. If True, the end position in the DataFrame is also used for reciprocal overlap without changes. Typically, this option should not be used and will break reciprocal overlap for insertions.
chunk_size – (Advanced) Chunk df_a into partitions of this size, and for each chunk, subset df_b to include only variants that may overlap with variants in the chunk. If None, each chromosome is a single chunk, which will lead to a combinatorial explosion unless offset_max is greater than 0.
- classmethod from_definition(overlap_def: collections.abc.Mapping[str, Any] | collections.abc.Iterable[collections.abc.Mapping[str, Any]])
Build a
PairwiseOverlapinstance from a stage definition mapping or sequence of mappings.
- join_iter(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) collections.abc.Iterator[polars.LazyFrame]
Find all pairs of variants in two sources that meet a set of criteria.
- Parameters:
df_a – Source dataframe.
df_b – Target dataframe.
retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.
temp_dir – How to materialise the prepared tables before the chunked loop.
False(default) collects both into memory;Truewrites them to the system temp directory as parquet files; astr/Pathwrites them to that directory. Temp files are always removed on exit.
- Yields:
A LazyFrame for each chunk.
- property is_equi_offset: bool
True if all stages have an equivalent equi-join expression for position.
- property match_prop_expr
Polars expression that computes the per-pair match proportion.
- match_score_model: agglovar.seqmatch.MatchScoreModel
- property required_cols: set[str]
The minimum set of columns that must be present in input tables.
This is set based on parameters needed to perform the join. For example, if sequence matching is required, then “seq” will be in this list, and if “seq” does not exist in both df_a and df_b, then an error is raised.
- property weight_expr
Polars expression for the configured weight strategy.
- class agglovar.pairwise.overlap.PairwiseOverlapStage(*, ro_min: float | None = None, size_ro_min: float | None = None, offset_max: int | None = None, offset_prop_max: float | None = None, seg_ro_min: float | None = None, match_ref: bool = False, match_alt: bool = False, match_vartype: bool = False, match_filter_pass: bool = False, match_prop_min: float | None = None, join_predicates: collections.abc.Iterable[polars.Expr] | None = None, join_filters: collections.abc.Iterable[polars.Expr] | None = None)
One join strategy implemented by overlap.
- Variables:
ro_min – Minimum reciprocal overlap for allowed matches. If 0.0, then any overlap matches.
size_ro_min – Reciprocal length proportion of allowed matches. If match_prop_min is set and the value of this parameter is None or is less than match_prop_min, then it is set to match_prop_min since this value represents the lower-bound of allowed match proportions.
offset_max – Maximum offset allowed (minimum of start or end position distance).
offset_prop_max – Maximum size-offset (offset / varlen) allowed.
seg_ro_min – Minimum reciprocal overlap by segments for merging complex variants. Compares where segments are aligned for each complex variant and computes a reciprocal-overlap for aligned segments between two variants. Unaligned bases overlap by default (i.e. minimum total unaligned bases is in the numerator of the overlap calculation).
match_ref – “REF” column must match between two variants.
match_alt – “ALT” column must match between two variants.
match_vartype – “VARTYPE” column must match between two variants.
match_filter_pass – If true, then matched variants must either pass filters (“filter” column emtpy for both variants) or must fail filters (“filter” column non-empty for both variants). This does not compare the contents of the “filter” column, just the presence or absence of a filter.
match_prop_min – Minimum matched base proportion in alignment or None to not match.
- agglovar.pairwise.overlap.AUTOGEN_COLS: frozenset[str]
Columns automatically generated by the join process for input tables if they are not present.
- agglovar.pairwise.overlap.DEFAULT_CHUNK_SIZE: int = 1500
Default number of df_a variants per chunk. Chunking controls peak memory and determines the batch size presented to the sequence-alignment thread pool — larger values amortize thread-pool overhead better but use more memory. Tune alongside
n_threadsbased on available RAM and CPU count.
- agglovar.pairwise.overlap.EXPR_OFFSET_DIST
Expression for computing offset distance.
- agglovar.pairwise.overlap.EXPR_OFFSET_PROP
Expression for computing the offset proportion.
- agglovar.pairwise.overlap.EXPR_OVERLAP_RO
Expression for overlap reciprocal overlap (RO).
- agglovar.pairwise.overlap.EXPR_SZRO
Expression for computing reciprocal size overlap (SZRO).
- agglovar.pairwise.overlap.INVARIANT_JOIN_COLS: list[str] = ['index_a', 'index_b']
Columns that appear in all joins regardless of parameters.
- agglovar.pairwise.overlap.JOIN_COL_EXPR: dict[str, polars.Expr | collections.abc.Callable[[Any], polars.Expr]]
Known join columns.
- agglovar.pairwise.overlap.RESERVED_COLS: frozenset[str]
Reserved columns are added automatically to input tables.
- agglovar.pairwise.overlap.__all__