agglovar.vcf

Routines for converting between VCF format and variant tables.

Attributes

VCF_SAMPLE_FIXED_SCHEMA

Fixed-column schema for the sample table returned by agglovar.vcf.iter_vcf().

VCF_SOURCE

Default source string written to VCF headers.

VCF_TO_POLARS_TYPE

Map VCF INFO/FORMAT type strings to Polars data types.

VCF_VERSION

VCF format version supported by agglovar.

Classes

AltHeader

A single ##ALT header record.

ContigHeader

A single ##contig header record.

FilterHeader

A single ##FILTER header record.

FormatHeader

A single ##FORMAT header record.

InfoHeader

A single ##INFO header record.

VcfBatch

A batch of classified variants from iter_vcf().

VcfHeader

Mutable, validated VCF file header.

Functions

iter_vcf(→ Generator[VcfBatch, None, None])

Iterate over a VCF/BCF file, yielding VcfBatch objects.

polars_to_vcf_type(→ str)

Return a VCF type string for a Polars data type.

read_vcf_header(→ VcfHeader)

Build a VcfHeader from an open pysam VariantFile.

write_vcf(→ None)

Write a per-allele Polars variant table to a VCF 4.2 file.

Package Contents

class agglovar.vcf.AltHeader

Bases: _VcfRecord

A single ##ALT header record.

Parameters:
  • id – Symbolic allele identifier without angle brackets (e.g. 'INS', 'DEL').

  • description – Human-readable description.

__repr__() str

Return a debug representation of this ALT header.

__str__() str

Return the VCF-formatted ##ALT header line.

description: str
class agglovar.vcf.ContigHeader

Bases: _VcfRecord

A single ##contig header record.

Parameters:
  • id – Contig name.

  • length – Sequence length in bases, or None if not specified.

  • url – URL referencing the sequence, or None.

  • md5 – MD5 checksum of the sequence, or None.

  • assembly – Assembly identifier, or None.

__post_init__() None

Validate fields after dataclass initialization.

__repr__() str

Return a debug representation of this contig header.

__str__() str

Return the VCF-formatted ##contig header line.

assembly: str | None = None
length: int | None = None
md5: str | None = None
url: str | None = None
class agglovar.vcf.FilterHeader

Bases: _VcfRecord

A single ##FILTER header record.

Parameters:
  • id – Filter identifier (e.g. 'PASS', 'LowQual').

  • description – Human-readable description.

__repr__() str

Return a debug representation of this FILTER header.

__str__() str

Return the VCF-formatted ##FILTER header line.

description: str
class agglovar.vcf.FormatHeader

Bases: _VcfRecord

A single ##FORMAT header record.

Parameters:
  • id – Field identifier (e.g. 'GT', 'DP').

  • number – Number of values (same encoding as InfoHeader).

  • type – VCF type string (same encoding as InfoHeader).

  • description – Human-readable description.

__post_init__() None

Validate number and type fields after dataclass initialization.

__repr__() str

Return a debug representation of this FORMAT header.

__str__() str

Return the VCF-formatted ##FORMAT header line.

description: str
number: int | str
type: str
class agglovar.vcf.InfoHeader

Bases: _VcfRecord

A single ##INFO header record.

Parameters:
  • id – Field identifier.

  • number – Number of values: non-negative int, or 'A', 'R', 'G', or '.' for per-allele, per-allele-including-ref, per-genotype, and variable.

  • type – VCF type string: one of 'Integer', 'Float', 'Flag', 'Character', 'String'.

  • description – Human-readable description.

  • source – Optional source annotation (VCF 4.2 §1.4.1).

  • version – Optional version annotation (VCF 4.2 §1.4.1).

__post_init__() None

Validate number and type fields after dataclass initialization.

__repr__() str

Return a debug representation of this INFO header.

__str__() str

Return the VCF-formatted ##INFO header line.

description: str
number: int | str
source: str | None = None
type: str
version: str | None = None
class agglovar.vcf.VcfBatch(header: agglovar.vcf._vcf_header.VcfHeader, snv: polars.LazyFrame, insdel: polars.LazyFrame, inv: polars.LazyFrame, dup: polars.LazyFrame, sub: polars.LazyFrame, cpx: polars.LazyFrame, ignored: polars.LazyFrame, sample_table: polars.LazyFrame, _counts: dict[str, int])

A batch of classified variants from iter_vcf().

Per-type tables are polars.LazyFrame objects that can be filtered or projected before collecting with .collect(). Within each batch the rows are sorted by chrom, pos, end, and id, but sort order is not guaranteed across batches from the same file.

Each base table contains the columns defined in agglovar.schema.STANDARD_FIELDS for that type, followed by the raw VCF columns vcf_pos, vcf_id, vcf_ref, vcf_alt, vcf_qual, vcf_rec, and vcf_allele, then any vcf_info_* columns declared in the VCF header.

Attr header:

VCF header parsed from the source file.

Attr snv:

SNV variants (vartype SNV).

Attr insdel:

Insertion and deletion variants (vartype INS or DEL).

Attr inv:

Inversion variants (vartype INV).

Attr dup:

Duplication variants (vartype DUP).

Attr sub:

Multi-nucleotide substitutions (vartype SUB); varlen is len(ref) + len(alt).

Attr cpx:

Complex variants (vartype CPX).

Attr ignored:

Records that could not be routed to a type table — BND breakends, TRA, unrecognized symbolic types, and classification errors. The vcf_ignored column contains the reason; vartype is preserved for known types (BND, TRA) and null otherwise.

Attr sample_table:

Long-format genotype data — one row per (vcf_rec, vcf_sample) pair, for records that produced at least one base-table row.

__repr__() str

Return a debug summary including per-type record counts.

__slots__ = ('header', 'snv', 'insdel', 'inv', 'dup', 'sub', 'cpx', 'ignored', 'sample_table', '_counts')
cpx
dup
header
ignored
insdel
inv
sample_table
snv
sub
class agglovar.vcf.VcfHeader(vcf_version: str = VCF_VERSION, file_date: str | None = None, source: str | None = VCF_SOURCE, reference: str | None = None, warn_invalid: bool = True)

Mutable, validated VCF file header.

All mutations validate their input immediately. Invalid field values (bad Number or Type strings) are stored with fallback values rather than raising, so broken VCF files can be read without crashing. Each stored record tracks its own errors via the errors tuple and is_valid property. The header-level is_valid and validation_errors aggregate across all records, making it easy to gate writes on validity.

The default constructor produces a minimal valid header with no fields:

hdr = VcfHeader()
hdr.add_contig('chr1', length=248_956_422)
hdr.add_info('SVTYPE', number=1, type='String', description='Variant type')
hdr.add_sample('NA12878')
print(hdr)              # VCF header text
hdr.is_valid            # True / False
hdr.validation_errors   # list of error strings
Parameters:
  • vcf_version – VCF spec version (default '4.2').

  • file_date – Date string YYYYMMDD. Defaults to today when None.

  • source – Value for ##source. Pass an empty string or None to suppress.

  • reference – Value for ##reference. Pass None to suppress.

  • warn_invalid – When True (default), invalid field values emit a UserWarning. Set to False to suppress these warnings (e.g. when bulk-reading known-broken files).

__repr__() str

Return a compact summary of the header contents.

Shows the VCF version, counts and IDs of key fields:

VcfHeader(version='4.2', contigs=25, info=['SVTYPE', 'SVLEN', 'END'],
          filters=['PASS'], format=['GT', 'GQ'], samples=['NA12878'])
__str__() str

Return the complete VCF header as a string.

Each line ends with \n, including the column-header line, so the result can be written directly to a file with file.write(str(hdr)).

add_alt(id: str, *, description: str, extra: dict[str, str] | None = None) None

Add an ##ALT record.

Raises:

ValueError – If an ALT with this id already exists.

add_contig(id: str, *, length: int | None = None, url: str | None = None, md5: str | None = None, assembly: str | None = None, extra: dict[str, str] | None = None) None

Add a ##contig record.

Raises:

ValueError – If a contig with this id already exists.

add_filter(id: str, *, description: str, extra: dict[str, str] | None = None) None

Add a ##FILTER record.

Raises:

ValueError – If a FILTER with this id already exists.

add_format(id: str, *, number: int | str, type: str, description: str, extra: dict[str, str] | None = None) None

Add a ##FORMAT record.

Invalid number or type values are stored with fallback values ('.' and 'String' respectively) and a UserWarning is emitted unless warn_invalid is False.

Raises:

ValueError – If a FORMAT field with this id already exists.

add_info(id: str, *, number: int | str, type: str, description: str, source: str | None = None, version: str | None = None, extra: dict[str, str] | None = None) None

Add an ##INFO record.

Invalid number or type values are stored with fallback values ('.' and 'String' respectively) and a UserWarning is emitted unless warn_invalid is False.

Raises:

ValueError – If an INFO field with this id already exists.

add_meta(key: str, value: str) None

Append an arbitrary ##key=value metadata line.

Structured fields (INFO, FORMAT, FILTER, contig, ALT) should be added via their dedicated methods. Duplicate keys are allowed since multiple lines with the same key are legal in VCF (they are written in insertion order).

add_sample(name: str) None

Append a sample name.

Raises:

ValueError – If name duplicates an existing sample or conflicts with a VCF fixed column name (#CHROM, POS, etc.).

remove_alt(id: str) None

Remove an ALT record by id.

Raises:

KeyError – If id is not present.

remove_contig(id: str) None

Remove a contig record by id.

Raises:

KeyError – If id is not present.

remove_filter(id: str) None

Remove a FILTER record by id.

Raises:

KeyError – If id is not present.

remove_format(id: str) None

Remove a FORMAT record by id.

Raises:

KeyError – If id is not present.

remove_info(id: str) None

Remove an INFO record by id.

Raises:

KeyError – If id is not present.

remove_sample(name: str) None

Remove a sample by name.

Raises:

ValueError – If name is not present.

property alts: list[AltHeader]

Ordered list of ALT records (read-only view).

property contigs: list[ContigHeader]

Ordered list of contig records (read-only view).

property file_date: str | None

File date string (YYYYMMDD) or None.

property filters: list[FilterHeader]

Ordered list of FILTER records (read-only view).

property formats: list[FormatHeader]

Ordered list of FORMAT field records (read-only view).

property info: list[InfoHeader]

Ordered list of INFO field records (read-only view).

property is_valid: bool

True if every stored record passes its own validation.

property meta: list[tuple[str, str]]

Ordered (key, value) pairs for arbitrary ##key=value metadata lines.

property reference: str | None

##reference value, or None to suppress the line.

property samples: list[str]

Ordered list of sample names (read-only view).

property source: str | None

##source value, or None to suppress the line.

property validation_errors: list[str]

Flat list of validation error strings for all invalid records.

Each entry is prefixed with the field type and ID for easy identification, e.g. "INFO 'SVTYPE': invalid Number value '-1'"

property vcf_version: str

VCF format version string (e.g. '4.2').

property warn_invalid: bool

When True, adding a field with invalid values emits a UserWarning.

agglovar.vcf.iter_vcf(path: str | pathlib.Path, *, batch_size: int = 10000, chrom: str | None = None, start: int | None = None, end: int | None = None) Generator[VcfBatch, None, None]

Iterate over a VCF/BCF file, yielding VcfBatch objects.

Each batch contains one polars.LazyFrame per variant type, an ignored frame, and a sample_table frame. The caller may filter or project any frame before calling .collect().

Batches contain at least batch_size base-table rows, but records are never split: a batch may exceed batch_size when the final record has multiple ALT alleles. The last batch may be smaller. At least one batch is always yielded, even for empty files.

Rows within each batch are sorted by chrom, pos, end, and id. Sort order is not guaranteed across batches.

BND breakend alleles (containing [ or ]) and unrecognized symbolic types are silently routed to the ignored table rather than emitted as base-table rows.

Parameters:
  • path – Path to a .vcf, .vcf.gz, or .bcf file.

  • batch_size – Target number of base-table rows per batch.

  • chrom – Restrict to records on this chromosome. Requires a tabix/CSI index for efficient access; falls back to a linear scan for unindexed files.

  • start – 0-based half-open BED start bound. Requires chrom.

  • end – 0-based half-open BED end bound. Requires chrom.

Raises:
  • FileNotFoundError – If path does not exist.

  • ValueError – If start/end is given without chrom, start > end, or a FORMAT field name collides with a reserved column name.

agglovar.vcf.polars_to_vcf_type(dtype: PolarsDataType) str

Return a VCF type string for a Polars data type.

All integer types map to "Integer", all float types map to "Float", polars.Boolean maps to "Flag", and string-like types (polars.String, polars.Categorical, polars.Enum) map to "String".

Parameters:

dtype – A Polars data type (class or instance).

Returns:

One of "Integer", "Float", "Flag", or "String".

Raises:

TypeError – If dtype has no corresponding VCF type.

agglovar.vcf.read_vcf_header(vcf_file: pysam.VariantFile) VcfHeader

Build a VcfHeader from an open pysam VariantFile.

Iterates vcf_file.header.records to preserve the original field order and to obtain raw string values (avoiding pysam’s numeric conversions for Number). Sample names are taken from vcf_file.header.samples.

Parameters:

vcf_file – An open pysam.VariantFile.

Returns:

A VcfHeader populated from the file’s header.

Raises:

ValueError – If the VCF version is missing or malformed.

agglovar.vcf.write_vcf(df: polars.DataFrame | polars.LazyFrame, path: str | pathlib.Path, *, ref_fasta: str | pathlib.Path | None = None, alt_format: Literal['seq', 'symbolic'] = 'seq', sample_name: str | None = None, source: str = VCF_SOURCE) None

Write a per-allele Polars variant table to a VCF 4.2 file.

Rows sharing the same vcf_rec are reassembled into a single multi-allelic VCF record with ALT alleles emitted in vcf_allele order. vcf_info_* columns are written back to INFO fields; the VCF header is rebuilt from column dtypes via agglovar.vcf.polars_to_vcf_type().

Parameters:
  • df – Variant table conforming to agglovar.vcf.VCF_BASE_FIXED_SCHEMA (plus any vcf_info_* columns). polars.LazyFrame inputs are collected internally.

  • path – Destination path.

  • ref_fasta – Path to an indexed reference FASTA. Required when alt_format is 'symbolic' and REF context must be recovered from the reference.

  • alt_format

    How to emit ALT alleles:

    • 'seq' (default) — write the literal sequence from vcf_alt.

    • 'symbolic' — write symbolic alleles (<INS>, <DEL>, <INV>, <DUP>) with inserted/deleted sequence in INFO/SEQ.

  • sample_name – Sample column name in the output FORMAT block.

  • source – Value for the ##source= header line.

Raises:
  • ValueError – If required columns are missing or alt_format constraints are violated.

  • FileNotFoundError – If ref_fasta is given but does not exist.

agglovar.vcf.VCF_SAMPLE_FIXED_SCHEMA: dict[str, PolarsDataType]

Fixed-column schema for the sample table returned by agglovar.vcf.iter_vcf().

The sample table uses long format: one row per (vcf_rec, sample_name) pair. Dynamic FORMAT columns (e.g. GT, GQ) for every FORMAT field declared in the VCF header are appended after these fixed columns.

agglovar.vcf.VCF_SOURCE: str = 'agglovar'

Default source string written to VCF headers.

agglovar.vcf.VCF_TO_POLARS_TYPE: dict[str, PolarsDataType]

Map VCF INFO/FORMAT type strings to Polars data types.

agglovar.vcf.VCF_VERSION: str = '4.2'

VCF format version supported by agglovar.