agglovar.util.lazy ================== .. py:module:: agglovar.util.lazy .. autoapi-nested-parse:: Shared materialization helpers for LazyFrame-based join pipelines. Chunked joins re-evaluate their input ``LazyFrame`` graphs every time the inner loop calls ``.collect()`` on a derived expression. When the source is a ``scan_parquet`` (or any frame with ``with_row_index`` blocking predicate pushdown) this means a full source scan per chunk. The fix is to materialise each input table exactly once before the chunk loop. ``materialize_pair`` is the single-source-of-truth for that policy across :mod:`agglovar.pairwise` and :mod:`agglovar.bed`. The ``temp_dir`` argument has the same meaning everywhere it appears: * ``False`` (default): collect both tables into memory and yield as ``LazyFrame`` (``df.collect().lazy()``). Lowest latency; uses RAM proportional to the input size. * ``True``: sink both tables to temporary parquet files in the system temp directory and yield ``scan_parquet`` over them. Files are unlinked when the context exits (even on error). * ``str`` or ``pathlib.Path``: same as ``True`` but in the given directory. Functions --------- .. autoapisummary:: agglovar.util.lazy.materialize_pair Module Contents --------------- .. py:function:: materialize_pair(df_a: polars.LazyFrame, df_b: polars.LazyFrame, temp_dir: bool | str | pathlib.Path = False, prefix: str = 'agglovar_lazy_') -> Iterator[tuple[polars.LazyFrame, polars.LazyFrame]] Yield ``(df_a, df_b)`` materialised once according to ``temp_dir``. :param df_a: First lazy table. :param df_b: Second lazy table. :param temp_dir: Materialisation policy. See module docstring. :param prefix: Filename prefix used for temp parquet files (only when ``temp_dir`` is truthy). :yields: A pair of ``LazyFrame`` instances backed by the materialised data.