agglovar.bed.join
Base join operations for intersects.
Functions
|
Join two tables. |
|
Join two tables, yielding one LazyFrame per chunk. |
Module Contents
- agglovar.bed.join.pairwise_join(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, col_names_b: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, temp_dir: bool | str | pathlib.Path = False) polars.LazyFrame
Join two tables.
Thin wrapper around
pairwise_join_iter()that concatenates all yielded chunks into a single table.Returns a table with columns:
index_a: Index in table a.
index_b: Index in table b.
chrom: Chromosome matched.
pos: Start position of intersection.
end: End position of intersection.
distance: Distance between the two intervals with negative values representing overlapping intervals.
Note that if padding is greater than 0, the “pos” and “end” will have been modified to include padding.
- Parameters:
df_a – Table a.
df_b – Table b.
distance – Maximum distance between two records. May be negative to force overlapping.
chunk_size – Chunk A by this size per chromosome to bound the IEJoin working set.
col_names_a – Columns to select from df_a if not None, otherwise, use object defaults.
col_names_b – Columns to select from df_b if not None, otherwise, use object defaults.
temp_dir – How to materialise the prepared tables before the chunked loop.
False(default) collects both into memory;Truewrites them to the system temp directory as parquet files; astr/Pathwrites them to that directory. Temp files are always removed on exit.
- Returns:
A LazyFrame with the joined tables.
- agglovar.bed.join.pairwise_join_iter(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, col_names_b: agglovar.bed.col.CoordCol | Iterable[str] | str | None = None, temp_dir: bool | str | pathlib.Path = False) Iterator[polars.LazyFrame]
Join two tables, yielding one LazyFrame per chunk.
Returns chunks with the same columns as
pairwise_join(). At least one chunk is always yielded; an empty schema-only frame is yielded when no chunk would otherwise have been produced (so callers can safely callpl.concaton the result).Each chunk is the result of a per-chromosome
join_where(Polars IEJoin) over a bounded slice of A and the chrom-matched B records pre-filtered to that slice’s range.- Parameters:
df_a – Table a.
df_b – Table b.
distance – Maximum distance between two records. May be negative to force overlapping.
chunk_size – Chunk A by this size per chromosome to bound the IEJoin working set.
col_names_a – Columns to select from df_a if not None, otherwise, use object defaults.
col_names_b – Columns to select from df_b if not None, otherwise, use object defaults.
temp_dir – How to materialise the prepared tables before the chunked loop. See
pairwise_join().
- Returns:
An iterator of LazyFrames.