agglovar.bed.join ================= .. py:module:: agglovar.bed.join .. autoapi-nested-parse:: Base join operations for intersects. Functions --------- .. autoapisummary:: agglovar.bed.join.pairwise_join agglovar.bed.join.pairwise_join_iter Module Contents --------------- .. py:function:: pairwise_join(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: Optional[agglovar.bed.col.CoordCol | Iterable[str] | str] = None, col_names_b: Optional[agglovar.bed.col.CoordCol | Iterable[str] | str] = None, temp_dir: bool | str | pathlib.Path = False) -> polars.LazyFrame Join two tables. Thin wrapper around :func:`pairwise_join_iter` that concatenates all yielded chunks into a single table. Returns a table with columns: * index_a: Index in table a. * index_b: Index in table b. * chrom: Chromosome matched. * pos: Start position of intersection. * end: End position of intersection. * distance: Distance between the two intervals with negative values representing overlapping intervals. Note that if padding is greater than 0, the "pos" and "end" will have been modified to include padding. :param df_a: Table a. :param df_b: Table b. :param distance: Maximum distance between two records. May be negative to force overlapping. :param chunk_size: Chunk A by this size per chromosome to bound the IEJoin working set. :param col_names_a: Columns to select from `df_a` if not None, otherwise, use object defaults. :param col_names_b: Columns to select from `df_b` if not None, otherwise, use object defaults. :param temp_dir: How to materialise the prepared tables before the chunked loop. ``False`` (default) collects both into memory; ``True`` writes them to the system temp directory as parquet files; a ``str``/``Path`` writes them to that directory. Temp files are always removed on exit. :return: A LazyFrame with the joined tables. .. py:function:: pairwise_join_iter(df_a: polars.LazyFrame | polars.DataFrame, df_b: polars.LazyFrame | polars.DataFrame, distance: int = 0, chunk_size: int = CHUNK_SIZE, col_names_a: Optional[agglovar.bed.col.CoordCol | Iterable[str] | str] = None, col_names_b: Optional[agglovar.bed.col.CoordCol | Iterable[str] | str] = None, temp_dir: bool | str | pathlib.Path = False) -> Iterator[polars.LazyFrame] Join two tables, yielding one LazyFrame per chunk. Returns chunks with the same columns as :func:`pairwise_join`. At least one chunk is always yielded; an empty schema-only frame is yielded when no chunk would otherwise have been produced (so callers can safely call ``pl.concat`` on the result). Each chunk is the result of a per-chromosome ``join_where`` (Polars IEJoin) over a bounded slice of A and the chrom-matched B records pre-filtered to that slice's range. :param df_a: Table a. :param df_b: Table b. :param distance: Maximum distance between two records. May be negative to force overlapping. :param chunk_size: Chunk A by this size per chromosome to bound the IEJoin working set. :param col_names_a: Columns to select from `df_a` if not None, otherwise, use object defaults. :param col_names_b: Columns to select from `df_b` if not None, otherwise, use object defaults. :param temp_dir: How to materialise the prepared tables before the chunked loop. See :func:`pairwise_join`. :return: An iterator of LazyFrames.