agglovar.util.lazy

Shared materialization helpers for LazyFrame-based join pipelines.

Chunked joins re-evaluate their input LazyFrame graphs every time the inner loop calls .collect() on a derived expression. When the source is a scan_parquet (or any frame with with_row_index blocking predicate pushdown) this means a full source scan per chunk. The fix is to materialise each input table exactly once before the chunk loop. materialize_pair is the single-source-of-truth for that policy across agglovar.pairwise and agglovar.bed.

The temp_dir argument has the same meaning everywhere it appears:

  • False (default): collect both tables into memory and yield as LazyFrame (df.collect().lazy()). Lowest latency; uses RAM proportional to the input size.

  • True: sink both tables to temporary parquet files in the system temp directory and yield scan_parquet over them. Files are unlinked when the context exits (even on error).

  • str or pathlib.Path: same as True but in the given directory.

Functions

materialize_pair(→ Iterator[tuple[polars.LazyFrame, ...)

Yield (df_a, df_b) materialised once according to temp_dir.

Module Contents

agglovar.util.lazy.materialize_pair(df_a: polars.LazyFrame, df_b: polars.LazyFrame, temp_dir: bool | str | pathlib.Path = False, prefix: str = 'agglovar_lazy_') Iterator[tuple[polars.LazyFrame, polars.LazyFrame]]

Yield (df_a, df_b) materialised once according to temp_dir.

Parameters:
  • df_a – First lazy table.

  • df_b – Second lazy table.

  • temp_dir – Materialisation policy. See module docstring.

  • prefix – Filename prefix used for temp parquet files (only when temp_dir is truthy).

Yields:

A pair of LazyFrame instances backed by the materialised data.