agglovar.pairwise.base

Base class for pairwise intersect strategies.

Defines an interface and implements common functionality for pairwise intersect strategies.

Classes

PairwiseJoin

Base class for pairwise intersection classes.

Module Contents

class agglovar.pairwise.base.PairwiseJoin(weight_strategy: agglovar.pairwise.weights.WeightStrategy = DEFAULT_WEIGHT_STRATEGY)

Bases: abc.ABC

Base class for pairwise intersection classes.

check_required_cols(df: polars.LazyFrame | polars.DataFrame | collections.abc.Iterable[str], raise_exception: bool = False) set[str]

Check if a table has the expected columns.

Parameters:
  • df – Table to check.

  • raise_exception – If True, raise an exception if any expected columns are missing.

Returns:

A set of missing columns.

Raises:

ValueError – If any expected columns are missing and raise_exception is True.

check_reserved_cols(df: polars.LazyFrame | polars.DataFrame | collections.abc.Iterable[str], raise_exception: bool = False) set[str]

Check if a table has reserved columns.

Parameters:
  • df – Table to check.

  • raise_exception – If True, raise an exception if any reserved columns are found.

Returns:

A set of reserved columns found in the table.

Raises:

ValueError – If any reserved columns are found and raise_exception is True.

join(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) polars.LazyFrame

Find all pairs of variants in two sources that meet a set of criteria.

This is a convenience function that calls join_iter() and concatenates the results.

Parameters:
  • df_a – Table A.

  • df_b – Table B.

  • retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.

  • temp_dir – See join_iter().

Returns:

A join table.

abstractmethod join_iter(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) collections.abc.Iterator[polars.LazyFrame]

Find all pairs of variants in two sources that meet a set of criteria.

Parameters:
  • df_a – Source dataframe.

  • df_b – Target dataframe.

  • retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.

  • temp_dir – How to materialise the prepared tables before the chunked loop. False (default) collects both into memory; True writes them to the system temp directory as parquet files; a str/Path writes them to that directory. Temp files are always removed on exit.

Yields:

A LazyFrame for each chunk.

property required_cols: set[str]
Abstractmethod:

The minimum set of columns that must be present in input tables.

property reserved_cols: set[str]

A set of columns that are reserved for internal use and must not be present in input tables.

property weight_strategy: agglovar.pairwise.weights.WeightStrategy

Weight strategy to use for this join.