agglovar.util.lazy
Shared materialization helpers for LazyFrame-based join pipelines.
Chunked joins re-evaluate their input LazyFrame graphs every time the
inner loop calls .collect() on a derived expression. When the source is a
scan_parquet (or any frame with with_row_index blocking predicate
pushdown) this means a full source scan per chunk. The fix is to materialise
each input table exactly once before the chunk loop. materialize_pair is
the single-source-of-truth for that policy across agglovar.pairwise
and agglovar.bed.
The temp_dir argument has the same meaning everywhere it appears:
False(default): collect both tables into memory and yield asLazyFrame(df.collect().lazy()). Lowest latency; uses RAM proportional to the input size.True: sink both tables to temporary parquet files in the system temp directory and yieldscan_parquetover them. Files are unlinked when the context exits (even on error).strorpathlib.Path: same asTruebut in the given directory.
Functions
|
Yield |
Module Contents
- agglovar.util.lazy.materialize_pair(df_a: polars.LazyFrame, df_b: polars.LazyFrame, temp_dir: bool | str | pathlib.Path = False, prefix: str = 'agglovar_lazy_') Iterator[tuple[polars.LazyFrame, polars.LazyFrame]]
Yield
(df_a, df_b)materialised once according totemp_dir.- Parameters:
df_a – First lazy table.
df_b – Second lazy table.
temp_dir – Materialisation policy. See module docstring.
prefix – Filename prefix used for temp parquet files (only when
temp_diris truthy).
- Yields:
A pair of
LazyFrameinstances backed by the materialised data.