agglovar.bed.join

Base join operations for intersects.

Functions

pairwise_join(→ polars.LazyFrame)

Join two tables.

pairwise_join_iter(→ Iterator[polars.LazyFrame])

Join two tables, yielding one LazyFrame per chunk.

Module Contents

agglovar.bed.join.pairwise_join(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, col_names_b: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, temp_dir: bool | str | pathlib.Path = False) polars.LazyFrame

Join two tables.

Thin wrapper around pairwise_join_iter() that concatenates all yielded chunks into a single table.

Returns a table with columns:

  • index_a: Index in table a.

  • index_b: Index in table b.

  • chrom: Chromosome matched.

  • pos: Start position of intersection.

  • end: End position of intersection.

  • distance: Distance between the two intervals with negative values representing overlapping intervals.

Note that if padding is greater than 0, the “pos” and “end” will have been modified to include padding.

Parameters:
  • df_a – Table a.

  • df_b – Table b.

  • distance – Maximum distance between two records. May be negative to force overlapping.

  • chunk_size – Chunk A by this size per chromosome to bound the IEJoin working set.

  • col_names_a – Columns to select from df_a if not None, otherwise, use object defaults.

  • col_names_b – Columns to select from df_b if not None, otherwise, use object defaults.

  • temp_dir – How to materialise the prepared tables before the chunked loop. False (default) collects both into memory; True writes them to the system temp directory as parquet files; a str/Path writes them to that directory. Temp files are always removed on exit.

Returns:

A LazyFrame with the joined tables.

agglovar.bed.join.pairwise_join_iter(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, col_names_b: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, temp_dir: bool | str | pathlib.Path = False) Iterator[polars.LazyFrame]

Join two tables, yielding one LazyFrame per chunk.

Returns chunks with the same columns as pairwise_join(). At least one chunk is always yielded; an empty schema-only frame is yielded when no chunk would otherwise have been produced (so callers can safely call pl.concat on the result).

Each chunk is the result of a per-chromosome join_where (Polars IEJoin) over a bounded slice of A and the chrom-matched B records pre-filtered to that slice’s range.

Parameters:
  • df_a – Table a.

  • df_b – Table b.

  • distance – Maximum distance between two records. May be negative to force overlapping.

  • chunk_size – Chunk A by this size per chromosome to bound the IEJoin working set.

  • col_names_a – Columns to select from df_a if not None, otherwise, use object defaults.

  • col_names_b – Columns to select from df_b if not None, otherwise, use object defaults.

  • temp_dir – How to materialise the prepared tables before the chunked loop. See pairwise_join().

Returns:

An iterator of LazyFrames.