agglovar.pairwise.base
Base class for pairwise intersect strategies.
Defines an interface and implements common functionality for pairwise intersect strategies.
Classes
Base class for pairwise intersection classes. |
Module Contents
- class agglovar.pairwise.base.PairwiseJoin(weight_strategy: agglovar.pairwise.weights.WeightStrategy = DEFAULT_WEIGHT_STRATEGY)
Bases:
abc.ABCBase class for pairwise intersection classes.
- check_required_cols(df: polars.LazyFrame | polars.DataFrame | collections.abc.Iterable[str], raise_exception: bool = False) set[str]
Check if a table has the expected columns.
- Parameters:
df – Table to check.
raise_exception – If True, raise an exception if any expected columns are missing.
- Returns:
A set of missing columns.
- Raises:
ValueError – If any expected columns are missing and raise_exception is True.
- check_reserved_cols(df: polars.LazyFrame | polars.DataFrame | collections.abc.Iterable[str], raise_exception: bool = False) set[str]
Check if a table has reserved columns.
- Parameters:
df – Table to check.
raise_exception – If True, raise an exception if any reserved columns are found.
- Returns:
A set of reserved columns found in the table.
- Raises:
ValueError – If any reserved columns are found and raise_exception is True.
- join(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) polars.LazyFrame
Find all pairs of variants in two sources that meet a set of criteria.
This is a convenience function that calls join_iter() and concatenates the results.
- Parameters:
df_a – Table A.
df_b – Table B.
retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.
temp_dir – See
join_iter().
- Returns:
A join table.
- abstractmethod join_iter(df_a: polars.DataFrame | polars.LazyFrame, df_b: polars.DataFrame | polars.LazyFrame, retain_index: bool = False, temp_dir: bool | str | pathlib.Path = False) collections.abc.Iterator[polars.LazyFrame]
Find all pairs of variants in two sources that meet a set of criteria.
- Parameters:
df_a – Source dataframe.
df_b – Target dataframe.
retain_index – If True, do not drop an existing “_index” column in callset tables if they exist.
temp_dir – How to materialise the prepared tables before the chunked loop.
False(default) collects both into memory;Truewrites them to the system temp directory as parquet files; astr/Pathwrites them to that directory. Temp files are always removed on exit.
- Yields:
A LazyFrame for each chunk.
- property required_cols: set[str]
- Abstractmethod:
The minimum set of columns that must be present in input tables.
- property reserved_cols: set[str]
A set of columns that are reserved for internal use and must not be present in input tables.
- property weight_strategy: agglovar.pairwise.weights.WeightStrategy
Weight strategy to use for this join.