agglovar.vcf ============ .. py:module:: agglovar.vcf .. autoapi-nested-parse:: Routines for converting between VCF format and variant tables. Attributes ---------- .. autoapisummary:: agglovar.vcf.VCF_SAMPLE_FIXED_SCHEMA agglovar.vcf.VCF_SOURCE agglovar.vcf.VCF_TO_POLARS_TYPE agglovar.vcf.VCF_VERSION Classes ------- .. autoapisummary:: agglovar.vcf.AltHeader agglovar.vcf.ContigHeader agglovar.vcf.FilterHeader agglovar.vcf.FormatHeader agglovar.vcf.InfoHeader agglovar.vcf.VcfBatch agglovar.vcf.VcfHeader Functions --------- .. autoapisummary:: agglovar.vcf.iter_vcf agglovar.vcf.polars_to_vcf_type agglovar.vcf.read_vcf_header agglovar.vcf.write_vcf Package Contents ---------------- .. py:class:: AltHeader Bases: :py:obj:`_VcfRecord` A single ``##ALT`` header record. :param id: Symbolic allele identifier without angle brackets (e.g. ``'INS'``, ``'DEL'``). :param description: Human-readable description. .. py:method:: __repr__() -> str Return a debug representation of this ALT header. .. py:method:: __str__() -> str Return the VCF-formatted ``##ALT`` header line. .. py:attribute:: description :type: str .. py:class:: ContigHeader Bases: :py:obj:`_VcfRecord` A single ``##contig`` header record. :param id: Contig name. :param length: Sequence length in bases, or ``None`` if not specified. :param url: URL referencing the sequence, or ``None``. :param md5: MD5 checksum of the sequence, or ``None``. :param assembly: Assembly identifier, or ``None``. .. py:method:: __post_init__() -> None Validate fields after dataclass initialization. .. py:method:: __repr__() -> str Return a debug representation of this contig header. .. py:method:: __str__() -> str Return the VCF-formatted ``##contig`` header line. .. py:attribute:: assembly :type: Optional[str] :value: None .. py:attribute:: length :type: Optional[int] :value: None .. py:attribute:: md5 :type: Optional[str] :value: None .. py:attribute:: url :type: Optional[str] :value: None .. py:class:: FilterHeader Bases: :py:obj:`_VcfRecord` A single ``##FILTER`` header record. :param id: Filter identifier (e.g. ``'PASS'``, ``'LowQual'``). :param description: Human-readable description. .. py:method:: __repr__() -> str Return a debug representation of this FILTER header. .. py:method:: __str__() -> str Return the VCF-formatted ``##FILTER`` header line. .. py:attribute:: description :type: str .. py:class:: FormatHeader Bases: :py:obj:`_VcfRecord` A single ``##FORMAT`` header record. :param id: Field identifier (e.g. ``'GT'``, ``'DP'``). :param number: Number of values (same encoding as :class:`InfoHeader`). :param type: VCF type string (same encoding as :class:`InfoHeader`). :param description: Human-readable description. .. py:method:: __post_init__() -> None Validate ``number`` and ``type`` fields after dataclass initialization. .. py:method:: __repr__() -> str Return a debug representation of this FORMAT header. .. py:method:: __str__() -> str Return the VCF-formatted ``##FORMAT`` header line. .. py:attribute:: description :type: str .. py:attribute:: number :type: int | str .. py:attribute:: type :type: str .. py:class:: InfoHeader Bases: :py:obj:`_VcfRecord` A single ``##INFO`` header record. :param id: Field identifier. :param number: Number of values: non-negative ``int``, or ``'A'``, ``'R'``, ``'G'``, or ``'.'`` for per-allele, per-allele-including-ref, per-genotype, and variable. :param type: VCF type string: one of ``'Integer'``, ``'Float'``, ``'Flag'``, ``'Character'``, ``'String'``. :param description: Human-readable description. :param source: Optional source annotation (VCF 4.2 §1.4.1). :param version: Optional version annotation (VCF 4.2 §1.4.1). .. py:method:: __post_init__() -> None Validate ``number`` and ``type`` fields after dataclass initialization. .. py:method:: __repr__() -> str Return a debug representation of this INFO header. .. py:method:: __str__() -> str Return the VCF-formatted ``##INFO`` header line. .. py:attribute:: description :type: str .. py:attribute:: number :type: int | str .. py:attribute:: source :type: Optional[str] :value: None .. py:attribute:: type :type: str .. py:attribute:: version :type: Optional[str] :value: None .. py:class:: VcfBatch(header: agglovar.vcf._vcf_header.VcfHeader, snv: polars.LazyFrame, insdel: polars.LazyFrame, inv: polars.LazyFrame, dup: polars.LazyFrame, sub: polars.LazyFrame, cpx: polars.LazyFrame, ignored: polars.LazyFrame, sample_table: polars.LazyFrame, _counts: dict[str, int]) A batch of classified variants from :func:`iter_vcf`. Per-type tables are :class:`polars.LazyFrame` objects that can be filtered or projected before collecting with ``.collect()``. Within each batch the rows are sorted by ``chrom``, ``pos``, ``end``, and ``id``, but sort order is not guaranteed across batches from the same file. Each base table contains the columns defined in :data:`agglovar.schema.STANDARD_FIELDS` for that type, followed by the raw VCF columns ``vcf_pos``, ``vcf_id``, ``vcf_ref``, ``vcf_alt``, ``vcf_qual``, ``vcf_rec``, and ``vcf_allele``, then any ``vcf_info_*`` columns declared in the VCF header. :attr header: VCF header parsed from the source file. :attr snv: SNV variants (vartype ``SNV``). :attr insdel: Insertion and deletion variants (vartype ``INS`` or ``DEL``). :attr inv: Inversion variants (vartype ``INV``). :attr dup: Duplication variants (vartype ``DUP``). :attr sub: Multi-nucleotide substitutions (vartype ``SUB``); ``varlen`` is ``len(ref) + len(alt)``. :attr cpx: Complex variants (vartype ``CPX``). :attr ignored: Records that could not be routed to a type table — BND breakends, TRA, unrecognized symbolic types, and classification errors. The ``vcf_ignored`` column contains the reason; ``vartype`` is preserved for known types (``BND``, ``TRA``) and ``null`` otherwise. :attr sample_table: Long-format genotype data — one row per ``(vcf_rec, vcf_sample)`` pair, for records that produced at least one base-table row. .. py:method:: __repr__() -> str Return a debug summary including per-type record counts. .. py:attribute:: __slots__ :value: ('header', 'snv', 'insdel', 'inv', 'dup', 'sub', 'cpx', 'ignored', 'sample_table', '_counts') .. py:attribute:: cpx .. py:attribute:: dup .. py:attribute:: header .. py:attribute:: ignored .. py:attribute:: insdel .. py:attribute:: inv .. py:attribute:: sample_table .. py:attribute:: snv .. py:attribute:: sub .. py:class:: VcfHeader(vcf_version: str = VCF_VERSION, file_date: Optional[str] = None, source: Optional[str] = VCF_SOURCE, reference: Optional[str] = None, warn_invalid: bool = True) Mutable, validated VCF file header. All mutations validate their input immediately. Invalid field values (bad Number or Type strings) are stored with fallback values rather than raising, so broken VCF files can be read without crashing. Each stored record tracks its own errors via the ``errors`` tuple and ``is_valid`` property. The header-level :attr:`is_valid` and :attr:`validation_errors` aggregate across all records, making it easy to gate writes on validity. The default constructor produces a minimal valid header with no fields:: hdr = VcfHeader() hdr.add_contig('chr1', length=248_956_422) hdr.add_info('SVTYPE', number=1, type='String', description='Variant type') hdr.add_sample('NA12878') print(hdr) # VCF header text hdr.is_valid # True / False hdr.validation_errors # list of error strings :param vcf_version: VCF spec version (default ``'4.2'``). :param file_date: Date string ``YYYYMMDD``. Defaults to today when ``None``. :param source: Value for ``##source``. Pass an empty string or ``None`` to suppress. :param reference: Value for ``##reference``. Pass ``None`` to suppress. :param warn_invalid: When ``True`` (default), invalid field values emit a :class:`UserWarning`. Set to ``False`` to suppress these warnings (e.g. when bulk-reading known-broken files). .. py:method:: __repr__() -> str Return a compact summary of the header contents. Shows the VCF version, counts and IDs of key fields:: VcfHeader(version='4.2', contigs=25, info=['SVTYPE', 'SVLEN', 'END'], filters=['PASS'], format=['GT', 'GQ'], samples=['NA12878']) .. py:method:: __str__() -> str Return the complete VCF header as a string. Each line ends with ``\n``, including the column-header line, so the result can be written directly to a file with ``file.write(str(hdr))``. .. py:method:: add_alt(id: str, *, description: str, extra: Optional[dict[str, str]] = None) -> None Add an ``##ALT`` record. :raises ValueError: If an ALT with this *id* already exists. .. py:method:: add_contig(id: str, *, length: Optional[int] = None, url: Optional[str] = None, md5: Optional[str] = None, assembly: Optional[str] = None, extra: Optional[dict[str, str]] = None) -> None Add a ``##contig`` record. :raises ValueError: If a contig with this *id* already exists. .. py:method:: add_filter(id: str, *, description: str, extra: Optional[dict[str, str]] = None) -> None Add a ``##FILTER`` record. :raises ValueError: If a FILTER with this *id* already exists. .. py:method:: add_format(id: str, *, number: int | str, type: str, description: str, extra: Optional[dict[str, str]] = None) -> None Add a ``##FORMAT`` record. Invalid *number* or *type* values are stored with fallback values (``'.'`` and ``'String'`` respectively) and a :class:`UserWarning` is emitted unless ``warn_invalid`` is ``False``. :raises ValueError: If a FORMAT field with this *id* already exists. .. py:method:: add_info(id: str, *, number: int | str, type: str, description: str, source: Optional[str] = None, version: Optional[str] = None, extra: Optional[dict[str, str]] = None) -> None Add an ``##INFO`` record. Invalid *number* or *type* values are stored with fallback values (``'.'`` and ``'String'`` respectively) and a :class:`UserWarning` is emitted unless ``warn_invalid`` is ``False``. :raises ValueError: If an INFO field with this *id* already exists. .. py:method:: add_meta(key: str, value: str) -> None Append an arbitrary ``##key=value`` metadata line. Structured fields (INFO, FORMAT, FILTER, contig, ALT) should be added via their dedicated methods. Duplicate keys are allowed since multiple lines with the same key are legal in VCF (they are written in insertion order). .. py:method:: add_sample(name: str) -> None Append a sample name. :raises ValueError: If *name* duplicates an existing sample or conflicts with a VCF fixed column name (``#CHROM``, ``POS``, etc.). .. py:method:: remove_alt(id: str) -> None Remove an ALT record by *id*. :raises KeyError: If *id* is not present. .. py:method:: remove_contig(id: str) -> None Remove a contig record by *id*. :raises KeyError: If *id* is not present. .. py:method:: remove_filter(id: str) -> None Remove a FILTER record by *id*. :raises KeyError: If *id* is not present. .. py:method:: remove_format(id: str) -> None Remove a FORMAT record by *id*. :raises KeyError: If *id* is not present. .. py:method:: remove_info(id: str) -> None Remove an INFO record by *id*. :raises KeyError: If *id* is not present. .. py:method:: remove_sample(name: str) -> None Remove a sample by name. :raises ValueError: If *name* is not present. .. py:property:: alts :type: list[AltHeader] Ordered list of ALT records (read-only view). .. py:property:: contigs :type: list[ContigHeader] Ordered list of contig records (read-only view). .. py:property:: file_date :type: Optional[str] File date string (``YYYYMMDD``) or ``None``. .. py:property:: filters :type: list[FilterHeader] Ordered list of FILTER records (read-only view). .. py:property:: formats :type: list[FormatHeader] Ordered list of FORMAT field records (read-only view). .. py:property:: info :type: list[InfoHeader] Ordered list of INFO field records (read-only view). .. py:property:: is_valid :type: bool ``True`` if every stored record passes its own validation. .. py:property:: meta :type: list[tuple[str, str]] Ordered ``(key, value)`` pairs for arbitrary ``##key=value`` metadata lines. .. py:property:: reference :type: Optional[str] ``##reference`` value, or ``None`` to suppress the line. .. py:property:: samples :type: list[str] Ordered list of sample names (read-only view). .. py:property:: source :type: Optional[str] ``##source`` value, or ``None`` to suppress the line. .. py:property:: validation_errors :type: list[str] Flat list of validation error strings for all invalid records. Each entry is prefixed with the field type and ID for easy identification, e.g. ``"INFO 'SVTYPE': invalid Number value '-1'"`` .. py:property:: vcf_version :type: str VCF format version string (e.g. ``'4.2'``). .. py:property:: warn_invalid :type: bool When ``True``, adding a field with invalid values emits a :class:`UserWarning`. .. py:function:: iter_vcf(path: str | pathlib.Path, *, batch_size: int = 10000, chrom: Optional[str] = None, start: Optional[int] = None, end: Optional[int] = None) -> Generator[VcfBatch, None, None] Iterate over a VCF/BCF file, yielding :class:`VcfBatch` objects. Each batch contains one :class:`polars.LazyFrame` per variant type, an ``ignored`` frame, and a ``sample_table`` frame. The caller may filter or project any frame before calling ``.collect()``. Batches contain at least *batch_size* base-table rows, but records are never split: a batch may exceed *batch_size* when the final record has multiple ALT alleles. The last batch may be smaller. At least one batch is always yielded, even for empty files. Rows within each batch are sorted by ``chrom``, ``pos``, ``end``, and ``id``. Sort order is not guaranteed across batches. BND breakend alleles (containing ``[`` or ``]``) and unrecognized symbolic types are silently routed to the ``ignored`` table rather than emitted as base-table rows. :param path: Path to a ``.vcf``, ``.vcf.gz``, or ``.bcf`` file. :param batch_size: Target number of base-table rows per batch. :param chrom: Restrict to records on this chromosome. Requires a tabix/CSI index for efficient access; falls back to a linear scan for unindexed files. :param start: 0-based half-open BED start bound. Requires *chrom*. :param end: 0-based half-open BED end bound. Requires *chrom*. :raises FileNotFoundError: If *path* does not exist. :raises ValueError: If *start*/*end* is given without *chrom*, *start* > *end*, or a FORMAT field name collides with a reserved column name. .. py:function:: polars_to_vcf_type(dtype: PolarsDataType) -> str Return a VCF type string for a Polars data type. All integer types map to ``"Integer"``, all float types map to ``"Float"``, :class:`polars.Boolean` maps to ``"Flag"``, and string-like types (:class:`polars.String`, :class:`polars.Categorical`, :class:`polars.Enum`) map to ``"String"``. :param dtype: A Polars data type (class or instance). :returns: One of ``"Integer"``, ``"Float"``, ``"Flag"``, or ``"String"``. :raises TypeError: If ``dtype`` has no corresponding VCF type. .. py:function:: read_vcf_header(vcf_file: pysam.VariantFile) -> VcfHeader Build a :class:`VcfHeader` from an open pysam :class:`~pysam.VariantFile`. Iterates ``vcf_file.header.records`` to preserve the original field order and to obtain raw string values (avoiding pysam's numeric conversions for ``Number``). Sample names are taken from ``vcf_file.header.samples``. :param vcf_file: An open :class:`pysam.VariantFile`. :returns: A :class:`VcfHeader` populated from the file's header. :raises ValueError: If the VCF version is missing or malformed. .. py:function:: write_vcf(df: polars.DataFrame | polars.LazyFrame, path: str | pathlib.Path, *, ref_fasta: str | pathlib.Path | None = None, alt_format: Literal['seq', 'symbolic'] = 'seq', sample_name: str | None = None, source: str = VCF_SOURCE) -> None Write a per-allele Polars variant table to a VCF 4.2 file. Rows sharing the same ``vcf_rec`` are reassembled into a single multi-allelic VCF record with ALT alleles emitted in ``vcf_allele`` order. ``vcf_info_*`` columns are written back to INFO fields; the VCF header is rebuilt from column dtypes via :func:`agglovar.vcf.polars_to_vcf_type`. :param df: Variant table conforming to :data:`agglovar.vcf.VCF_BASE_FIXED_SCHEMA` (plus any ``vcf_info_*`` columns). :class:`polars.LazyFrame` inputs are collected internally. :param path: Destination path. :param ref_fasta: Path to an indexed reference FASTA. Required when ``alt_format`` is ``'symbolic'`` and REF context must be recovered from the reference. :param alt_format: How to emit ALT alleles: * ``'seq'`` (default) — write the literal sequence from ``vcf_alt``. * ``'symbolic'`` — write symbolic alleles (````, ````, ````, ````) with inserted/deleted sequence in ``INFO/SEQ``. :param sample_name: Sample column name in the output FORMAT block. :param source: Value for the ``##source=`` header line. :raises ValueError: If required columns are missing or ``alt_format`` constraints are violated. :raises FileNotFoundError: If ``ref_fasta`` is given but does not exist. .. py:data:: VCF_SAMPLE_FIXED_SCHEMA :type: dict[str, PolarsDataType] Fixed-column schema for the sample table returned by :func:`agglovar.vcf.iter_vcf`. The sample table uses long format: one row per ``(vcf_rec, sample_name)`` pair. Dynamic FORMAT columns (e.g. ``GT``, ``GQ``) for every FORMAT field declared in the VCF header are appended after these fixed columns. .. py:data:: VCF_SOURCE :type: str :value: 'agglovar' Default source string written to VCF headers. .. py:data:: VCF_TO_POLARS_TYPE :type: dict[str, PolarsDataType] Map VCF INFO/FORMAT type strings to Polars data types. .. py:data:: VCF_VERSION :type: str :value: '4.2' VCF format version supported by agglovar.