agglovar.vcf
Routines for converting between VCF format and variant tables.
Attributes
Fixed-column schema for the sample table returned by |
|
Default source string written to VCF headers. |
|
Map VCF INFO/FORMAT type strings to Polars data types. |
|
VCF format version supported by agglovar. |
Classes
A single |
|
A single |
|
A single |
|
A single |
|
A single |
|
A batch of classified variants from |
|
Mutable, validated VCF file header. |
Functions
|
Iterate over a VCF/BCF file, yielding |
|
Return a VCF type string for a Polars data type. |
|
Build a |
|
Write a per-allele Polars variant table to a VCF 4.2 file. |
Package Contents
- class agglovar.vcf.AltHeader
Bases:
_VcfRecordA single
##ALTheader record.- Parameters:
id – Symbolic allele identifier without angle brackets (e.g.
'INS','DEL').description – Human-readable description.
- class agglovar.vcf.ContigHeader
Bases:
_VcfRecordA single
##contigheader record.- Parameters:
id – Contig name.
length – Sequence length in bases, or
Noneif not specified.url – URL referencing the sequence, or
None.md5 – MD5 checksum of the sequence, or
None.assembly – Assembly identifier, or
None.
- class agglovar.vcf.FilterHeader
Bases:
_VcfRecordA single
##FILTERheader record.- Parameters:
id – Filter identifier (e.g.
'PASS','LowQual').description – Human-readable description.
- class agglovar.vcf.FormatHeader
Bases:
_VcfRecordA single
##FORMATheader record.- Parameters:
id – Field identifier (e.g.
'GT','DP').number – Number of values (same encoding as
InfoHeader).type – VCF type string (same encoding as
InfoHeader).description – Human-readable description.
- class agglovar.vcf.InfoHeader
Bases:
_VcfRecordA single
##INFOheader record.- Parameters:
id – Field identifier.
number – Number of values: non-negative
int, or'A','R','G', or'.'for per-allele, per-allele-including-ref, per-genotype, and variable.type – VCF type string: one of
'Integer','Float','Flag','Character','String'.description – Human-readable description.
source – Optional source annotation (VCF 4.2 §1.4.1).
version – Optional version annotation (VCF 4.2 §1.4.1).
- class agglovar.vcf.VcfBatch(header: agglovar.vcf._vcf_header.VcfHeader, snv: polars.LazyFrame, insdel: polars.LazyFrame, inv: polars.LazyFrame, dup: polars.LazyFrame, sub: polars.LazyFrame, cpx: polars.LazyFrame, ignored: polars.LazyFrame, sample_table: polars.LazyFrame, _counts: dict[str, int])
A batch of classified variants from
iter_vcf().Per-type tables are
polars.LazyFrameobjects that can be filtered or projected before collecting with.collect(). Within each batch the rows are sorted bychrom,pos,end, andid, but sort order is not guaranteed across batches from the same file.Each base table contains the columns defined in
agglovar.schema.STANDARD_FIELDSfor that type, followed by the raw VCF columnsvcf_pos,vcf_id,vcf_ref,vcf_alt,vcf_qual,vcf_rec, andvcf_allele, then anyvcf_info_*columns declared in the VCF header.- Attr header:
VCF header parsed from the source file.
- Attr snv:
SNV variants (vartype
SNV).- Attr insdel:
Insertion and deletion variants (vartype
INSorDEL).- Attr inv:
Inversion variants (vartype
INV).- Attr dup:
Duplication variants (vartype
DUP).- Attr sub:
Multi-nucleotide substitutions (vartype
SUB);varlenislen(ref) + len(alt).- Attr cpx:
Complex variants (vartype
CPX).- Attr ignored:
Records that could not be routed to a type table — BND breakends, TRA, unrecognized symbolic types, and classification errors. The
vcf_ignoredcolumn contains the reason;vartypeis preserved for known types (BND,TRA) andnullotherwise.- Attr sample_table:
Long-format genotype data — one row per
(vcf_rec, vcf_sample)pair, for records that produced at least one base-table row.
- __slots__ = ('header', 'snv', 'insdel', 'inv', 'dup', 'sub', 'cpx', 'ignored', 'sample_table', '_counts')
- cpx
- dup
- header
- ignored
- insdel
- inv
- sample_table
- snv
- sub
- class agglovar.vcf.VcfHeader(vcf_version: str = VCF_VERSION, file_date: str | None = None, source: str | None = VCF_SOURCE, reference: str | None = None, warn_invalid: bool = True)
Mutable, validated VCF file header.
All mutations validate their input immediately. Invalid field values (bad Number or Type strings) are stored with fallback values rather than raising, so broken VCF files can be read without crashing. Each stored record tracks its own errors via the
errorstuple andis_validproperty. The header-levelis_validandvalidation_errorsaggregate across all records, making it easy to gate writes on validity.The default constructor produces a minimal valid header with no fields:
hdr = VcfHeader() hdr.add_contig('chr1', length=248_956_422) hdr.add_info('SVTYPE', number=1, type='String', description='Variant type') hdr.add_sample('NA12878') print(hdr) # VCF header text hdr.is_valid # True / False hdr.validation_errors # list of error strings
- Parameters:
vcf_version – VCF spec version (default
'4.2').file_date – Date string
YYYYMMDD. Defaults to today whenNone.source – Value for
##source. Pass an empty string orNoneto suppress.reference – Value for
##reference. PassNoneto suppress.warn_invalid – When
True(default), invalid field values emit aUserWarning. Set toFalseto suppress these warnings (e.g. when bulk-reading known-broken files).
- __repr__() str
Return a compact summary of the header contents.
Shows the VCF version, counts and IDs of key fields:
VcfHeader(version='4.2', contigs=25, info=['SVTYPE', 'SVLEN', 'END'], filters=['PASS'], format=['GT', 'GQ'], samples=['NA12878'])
- __str__() str
Return the complete VCF header as a string.
Each line ends with
\n, including the column-header line, so the result can be written directly to a file withfile.write(str(hdr)).
- add_alt(id: str, *, description: str, extra: dict[str, str] | None = None) None
Add an
##ALTrecord.- Raises:
ValueError – If an ALT with this id already exists.
- add_contig(id: str, *, length: int | None = None, url: str | None = None, md5: str | None = None, assembly: str | None = None, extra: dict[str, str] | None = None) None
Add a
##contigrecord.- Raises:
ValueError – If a contig with this id already exists.
- add_filter(id: str, *, description: str, extra: dict[str, str] | None = None) None
Add a
##FILTERrecord.- Raises:
ValueError – If a FILTER with this id already exists.
- add_format(id: str, *, number: int | str, type: str, description: str, extra: dict[str, str] | None = None) None
Add a
##FORMATrecord.Invalid number or type values are stored with fallback values (
'.'and'String'respectively) and aUserWarningis emitted unlesswarn_invalidisFalse.- Raises:
ValueError – If a FORMAT field with this id already exists.
- add_info(id: str, *, number: int | str, type: str, description: str, source: str | None = None, version: str | None = None, extra: dict[str, str] | None = None) None
Add an
##INFOrecord.Invalid number or type values are stored with fallback values (
'.'and'String'respectively) and aUserWarningis emitted unlesswarn_invalidisFalse.- Raises:
ValueError – If an INFO field with this id already exists.
- add_meta(key: str, value: str) None
Append an arbitrary
##key=valuemetadata line.Structured fields (INFO, FORMAT, FILTER, contig, ALT) should be added via their dedicated methods. Duplicate keys are allowed since multiple lines with the same key are legal in VCF (they are written in insertion order).
- add_sample(name: str) None
Append a sample name.
- Raises:
ValueError – If name duplicates an existing sample or conflicts with a VCF fixed column name (
#CHROM,POS, etc.).
- remove_sample(name: str) None
Remove a sample by name.
- Raises:
ValueError – If name is not present.
- property contigs: list[ContigHeader]
Ordered list of contig records (read-only view).
- property filters: list[FilterHeader]
Ordered list of FILTER records (read-only view).
- property formats: list[FormatHeader]
Ordered list of FORMAT field records (read-only view).
- property info: list[InfoHeader]
Ordered list of INFO field records (read-only view).
- property meta: list[tuple[str, str]]
Ordered
(key, value)pairs for arbitrary##key=valuemetadata lines.
- property validation_errors: list[str]
Flat list of validation error strings for all invalid records.
Each entry is prefixed with the field type and ID for easy identification, e.g.
"INFO 'SVTYPE': invalid Number value '-1'"
- property warn_invalid: bool
When
True, adding a field with invalid values emits aUserWarning.
- agglovar.vcf.iter_vcf(path: str | pathlib.Path, *, batch_size: int = 10000, chrom: str | None = None, start: int | None = None, end: int | None = None) Generator[VcfBatch, None, None]
Iterate over a VCF/BCF file, yielding
VcfBatchobjects.Each batch contains one
polars.LazyFrameper variant type, anignoredframe, and asample_tableframe. The caller may filter or project any frame before calling.collect().Batches contain at least batch_size base-table rows, but records are never split: a batch may exceed batch_size when the final record has multiple ALT alleles. The last batch may be smaller. At least one batch is always yielded, even for empty files.
Rows within each batch are sorted by
chrom,pos,end, andid. Sort order is not guaranteed across batches.BND breakend alleles (containing
[or]) and unrecognized symbolic types are silently routed to theignoredtable rather than emitted as base-table rows.- Parameters:
path – Path to a
.vcf,.vcf.gz, or.bcffile.batch_size – Target number of base-table rows per batch.
chrom – Restrict to records on this chromosome. Requires a tabix/CSI index for efficient access; falls back to a linear scan for unindexed files.
start – 0-based half-open BED start bound. Requires chrom.
end – 0-based half-open BED end bound. Requires chrom.
- Raises:
FileNotFoundError – If path does not exist.
ValueError – If start/end is given without chrom, start > end, or a FORMAT field name collides with a reserved column name.
- agglovar.vcf.polars_to_vcf_type(dtype: PolarsDataType) str
Return a VCF type string for a Polars data type.
All integer types map to
"Integer", all float types map to"Float",polars.Booleanmaps to"Flag", and string-like types (polars.String,polars.Categorical,polars.Enum) map to"String".- Parameters:
dtype – A Polars data type (class or instance).
- Returns:
One of
"Integer","Float","Flag", or"String".- Raises:
TypeError – If
dtypehas no corresponding VCF type.
- agglovar.vcf.read_vcf_header(vcf_file: pysam.VariantFile) VcfHeader
Build a
VcfHeaderfrom an open pysamVariantFile.Iterates
vcf_file.header.recordsto preserve the original field order and to obtain raw string values (avoiding pysam’s numeric conversions forNumber). Sample names are taken fromvcf_file.header.samples.- Parameters:
vcf_file – An open
pysam.VariantFile.- Returns:
A
VcfHeaderpopulated from the file’s header.- Raises:
ValueError – If the VCF version is missing or malformed.
- agglovar.vcf.write_vcf(df: polars.DataFrame | polars.LazyFrame, path: str | pathlib.Path, *, ref_fasta: str | pathlib.Path | None = None, alt_format: Literal['seq', 'symbolic'] = 'seq', sample_name: str | None = None, source: str = VCF_SOURCE) None
Write a per-allele Polars variant table to a VCF 4.2 file.
Rows sharing the same
vcf_recare reassembled into a single multi-allelic VCF record with ALT alleles emitted invcf_alleleorder.vcf_info_*columns are written back to INFO fields; the VCF header is rebuilt from column dtypes viaagglovar.vcf.polars_to_vcf_type().- Parameters:
df – Variant table conforming to
agglovar.vcf.VCF_BASE_FIXED_SCHEMA(plus anyvcf_info_*columns).polars.LazyFrameinputs are collected internally.path – Destination path.
ref_fasta – Path to an indexed reference FASTA. Required when
alt_formatis'symbolic'and REF context must be recovered from the reference.alt_format –
How to emit ALT alleles:
'seq'(default) — write the literal sequence fromvcf_alt.'symbolic'— write symbolic alleles (<INS>,<DEL>,<INV>,<DUP>) with inserted/deleted sequence inINFO/SEQ.
sample_name – Sample column name in the output FORMAT block.
source – Value for the
##source=header line.
- Raises:
ValueError – If required columns are missing or
alt_formatconstraints are violated.FileNotFoundError – If
ref_fastais given but does not exist.
- agglovar.vcf.VCF_SAMPLE_FIXED_SCHEMA: dict[str, PolarsDataType]
Fixed-column schema for the sample table returned by
agglovar.vcf.iter_vcf().The sample table uses long format: one row per
(vcf_rec, sample_name)pair. Dynamic FORMAT columns (e.g.GT,GQ) for every FORMAT field declared in the VCF header are appended after these fixed columns.