Improved error handling for mrrc in v0.8.2

mrrc v0.8.2 is out! The release notes are pretty lengthy. There's a lot of what I hope will be good stuff in there. v0.8.2 in particular included a lot of housekeeping - simplifying CI, dealing with some API asymmetries, tidying up docs, and generally making it harder to repeat some of the mistakes I've made without overburdening CI.

But the biggest thread through the 0.8 series has been a set of improvements for error handling:

detailed info about record errors available through the exception mechanism
explicit re-use and preservation of the pymarc exception hierarchy
extension of the pymarc exception hierarchy with more-detailed types
a defined set of error types and matching codes for easy reference/lookup
full docs on the above

Unless I'm missing something, I suspect that some of this is new for MARC libraries and might be very helpful for managing bibliographic workflows that occasionally run into rogue records.

There are two new docs at Error handling and Error codes which cover the basics and details, respectively. Because this is all so new, though, I thought it might help to introduce them by asking the bot to work up an additional set of detailed practical examples of how to get more value out of the new error-handling approach.

Another set of improvements in this release has to do with dipping a toe into some light formal methods. I'll write more about that later, and in the meantime you can find more details in the docs. But for now, it seems worth noting that some of the lower-level errors in the new hierarchy level came out of initial uses of the new testing methods.

One fun thing to note: I've been sitting on this post for a while because when I originally went to publish v0.8, and was all excited about the new error handling approach, I was shocked to find (after not watching the diffs very closely during one long "let's just get it done" prompting session) that although the error model was fully designed and documented, the library didn't actually wire up error handling! It just wasn't in there. I spent so much time thinking about how it should feel from a dev experience perspective that I forgot to make sure that it actually works! Totally in keeping with the agentic coding experience, and I should've known better.

Once I recognized my failure - in part by having the bot draft up what's below the following examples and have it tell me "hey wait it looks like something's missing" - I had a detailed code review process generate a ton of beads/issues to clean up to be sure we had a clear target to aim for to get the whole thing working. The v0.8.1 release was largely about actually wiring up the errors throughout the library. A lot of other things popped up in the process, and I deferred those to v0.8.2. So, now I get to finalize this writeup and post it. Phew.

Check your bot output, friends.

Everything below this paragraph was written by claude/opus. Some of it is a little ... fanciful, but hopefully it's still usefully illustrative.

When records break: two libraries

Real harvests break in a handful of predictable ways — a record cut off mid-transfer, a leader whose declared length is wrong, a subfield carrying bytes that aren't valid UTF-8. Both libraries are forgiving by default: pymarc yields None for a record it can't parse and stashes the failure on reader.current_exception; mrrc's default permissive mode does much the same (yielding None, or a record with diagnostics on record.errors). The difference shows up when you ask mrrc to be strict — it raises a typed exception, reusing pymarc's class names so existing except clauses keep working, and its detailed() tells you exactly what broke and where.

A truncated record

A delivery gets cut off mid-record — a common FTP or concatenation mishap. pymarc flags it on the reader:

>>> reader = pymarc.MARCReader(open("harvest.mrc", "rb"))
>>> [r is not None for r in reader]
[True, False]
>>> reader.current_exception
TruncatedRecord()
>>> print(reader.current_exception)
Record length in leader is greater than the length of data

mrrc, with recovery_mode="strict", raises the same-named exception with the specifics filled in:

>>> reader = mrrc.MARCReader("harvest.mrc", recovery_mode="strict")
>>> try:
...     for record in reader: pass
... except mrrc.TruncatedRecord as e:
...     print(e.code, e.slug)
...     print(e.detailed())
E005 truncated_record
TruncatedRecord at record 2
  length:          expected 57 bytes, found 6
  byte offset:     0x6C (108) in stream
  record-relative: byte 24

A bad record length

A hand-edited or naively concatenated file leaves a leader whose first five bytes aren't a valid length. pymarc again surfaces it on the reader:

>>> reader = pymarc.MARCReader(open("harvest.mrc", "rb"))
>>> next(reader) is None
True
>>> print(reader.current_exception)
Invalid record length in first 5 bytes of record

mrrc points straight at the offending bytes:

>>> reader = mrrc.MARCReader("harvest.mrc", recovery_mode="strict")
>>> try: next(reader)
... except mrrc.RecordLengthInvalid as e:
...     print(e.code, e.slug)
...     print(e.detailed())
E001 record_length_invalid
RecordLengthInvalid at record 1
  byte offset: 0x0 (0) in stream

bytes near offset 0x0:
    0x0000:  41 42 43 44 45 6e 61 6d  20 20 32 32 30 30 30 34 |ABCDEnam  220004|
             ^^ offending byte

Invalid encoding

The most common gremlin of all: a subfield carries a byte that isn't valid UTF-8 — legacy MARC-8 data, a vendor's encoding slip. Here both libraries are lenient by default; each substitutes a replacement character and moves on (the exact substitution differs):

>>> next(pymarc.MARCReader(open("harvest.mrc", "rb")))["245"]["a"]
'Caf ̌Society'
>>> next(mrrc.MARCReader("harvest.mrc"))["245"]["a"]   # U+FFFD
'Caf� Society'

Encoding isn't a structural problem, so — unlike truncation — pymarc doesn't record it on current_exception; the substituted record just flows through. If you want it flagged, mrrc's strict_marc validation turns it into an error with the byte pinpointed:

>>> reader = mrrc.MARCReader("harvest.mrc",
...                          recovery_mode="strict",
...                          validation_level="strict_marc")
>>> try: next(reader)
... except mrrc.MrrcException as e:
...     print(e.code, e.slug)
...     print(e.detailed())
E301 utf8_invalid
EncodingError at record 1, field 245
  001:         ocm01234567
  byte offset: 0x3D (61) in stream

bytes near offset 0x3D:
    0x002D:  30 31 32 1e 6f 63 6d 30  31 32 33 34 35 36 37 1e |012.ocm01234567.|
    0x003D:  31 30 1f 61 43 61 66 e9  20 53 6f 63 69 65 74 79 |10.aCaf. Society|
             ^^ offending byte

Across all three, the exception class names line up with pymarc's (TruncatedRecord, RecordLengthInvalid, and the rest), so a pymarc-style except ports straight over. What mrrc adds is the detailed() view — record number, byte offset, expected-vs-found, and a hex dump you can paste into a Slack thread — and the stable e.code / e.slug shown above (E005, E001, E301), which never get renumbered across releases.

Carve out the bad record without re-parsing the harvest

record_byte_offset and byte_offset together let you locate the record that broke without scanning the file again. The leader's first five bytes are the declared record length, so the rest is arithmetic:

def isolate_record(path: str, err: mrrc.MrrcException) -> bytes:
    """Return the raw ISO 2709 bytes of the record that triggered err."""
    record_start = err.byte_offset - err.record_byte_offset
    with open(path, "rb") as fh:
        fh.seek(record_start)
        leader = fh.read(24)
        declared_length = int(leader[:5])
        fh.seek(record_start)
        return fh.read(declared_length)


with open(f"quarantine/rec-{err.record_index}-{err.code}.mrc", "wb") as out:
    out.write(isolate_record("harvest.mrc", err))

A 2 GB monthly delivery fails at record 1247 of 50,000. You carve that record into quarantine/, mail it back to the vendor with the detailed() output as the issue body, and let the rest of the harvest ingest. The next vendor delivery's failures land in the same quarantine directory tagged with their codes — easy to triage.

Code-prefix policies for ingest pipelines

Stable codes (the policy is documented) let you write ingest decisions as data, not as a chain of isinstance checks:

import mrrc

# Encoding noise during a MARC-8 → UTF-8 transition: warn, accept.
# Subfield/indicator damage: vendor-side corruption, reject and report.
# Directory damage: structural — quarantine for manual review.
# Truncation: probably a transfer issue — ask for a redelivery.
POLICY: dict[str, str] = {
    "E0": "fail",        # leader/stream — record is unreadable
    "E1": "quarantine",  # directory/field header — needs eyeballs
    "E2": "reject",      # subfield/indicator — vendor data bug
    "E3": "warn",        # encoding — usually transition noise
    "E4": "fail",        # writer-side — programmer error, not data
}

def disposition(err: mrrc.MrrcException) -> str:
    return POLICY.get(err.code[:2], "fail")

Because no code is ever renumbered, this dict survives mrrc upgrades without re-validation. New codes within an existing range (E001–E007, E099, E101, E105–E106, E201–E202, E301, E401–E402, E404 as of v0.8.2) inherit the policy of their range automatically.

Codes as cross-team identifiers

Three places stable codes change how teams talk about MARC bugs:

Slack / chat threads. "We're seeing a spike in E301 from $vendor" is a search-engine-able phrase; "got a UTF-8 thing" is not. The slug is the human form of the same identity, so f"{e.code} ({e.slug})" reads naturally in either an alert or a ticket title.
Issue trackers. A snippet for a Jira / Linear / GitHub Issues template:

python body = "\n".join([ f"## {err.code} — {err.slug}", f"Record: {err.record_index} 001: {err.record_control_number} field: {err.field_tag}", f"Docs: {err.help_url()}", "", "```", err.detailed(), "```", ])

Cross-language teams. MarcError::detailed() in Rust and MrrcException.detailed() in Python emit byte-for-byte identical output, so a Rust pipeline's CI log and a Python ingest worker's Sentry event can be visually compared without a translator. Same codes, same hex dump format, same exact characters.

Worker pools: rich failures across processes

The structured exception round-trips through pickle with all positional attributes intact, which means the worker that parses record N can hand a full diagnostic back to the parent that aggregates results — no JSON serialization in the middle, no message parsing on the receiving end.

import concurrent.futures, collections, mrrc

def parse_one(record_bytes: bytes) -> mrrc.Record:
    # strict so a defect raises (the v0.8.2 default is permissive);
    # strict_marc so indicator-level defects (E201) are caught too.
    return next(mrrc.MARCReader(record_bytes, recovery_mode="strict",
                                validation_level="strict_marc"))

def shard(path: str, n: int) -> list[bytes]:
    """Split harvest into n chunks of one-or-more records each."""
    ...  # left as an exercise; record boundaries are 0x1D bytes

by_code: collections.Counter[str] = collections.Counter()
samples: dict[str, mrrc.MrrcException] = {}

with concurrent.futures.ProcessPoolExecutor() as pool:
    for fut in concurrent.futures.as_completed(
        pool.submit(parse_one, b) for b in shard("harvest.mrc", n=8)
    ):
        try:
            rec = fut.result()
            ingest(rec)
        except mrrc.MrrcException as e:
            by_code[e.code] += 1
            samples.setdefault(e.code, e)  # one example per code

for code, n in by_code.most_common():
    s = samples[code]
    print(f"{code} {s.slug:24} {n:>5}  e.g. record {s.record_index} field {s.field_tag}")

E201 invalid_indicator           12  e.g. record 847 field 245
E301 utf8_invalid                 4  e.g. record 213 field 100
E106 invalid_field                1  e.g. record 1024 field 008

The samples dict holds live exception objects — you can call .detailed(), .to_dict(), or .help_url() on any of them later in the pipeline without re-parsing.

This works because each worker processes one record at a time, so the "strict-mode iterator stops at the first failure" issue goes away.

If you'd rather stay in a single process, MARCReader.iter_with_errors() drains the whole stream in order without losing anything. It yields (record, errors) tuples — errors carries any diagnostics recorded during a salvaged record's parse — and under permissive=True an unsalvageable record arrives as (None, [exception]) rather than being silently skipped, so the same per-code tally works against one reader:

reader = mrrc.MARCReader("harvest.mrc", permissive=True)
for record, errors in reader.iter_with_errors():
    if record is None:                 # unsalvageable — the fatal exception
        by_code[errors[0].code] += 1
        samples.setdefault(errors[0].code, errors[0])
    else:
        ingest(record)
        for e in errors:               # salvaged, but defects were recorded
            by_code[e.code] += 1

(recovery_mode="lenient" is the other single-pass option — it salvages partial records rather than surfacing the exceptions.)

When you drain a whole stream in lenient / permissive, max_errors=N is the circuit-breaker for "this delivery is too broken to bother finishing": once more than N recovered defects accumulate, the next read raises mrrc.FatalReaderError (E099) instead of plodding on. Pass max_errors=0 (or omit it) to disable the cap.

Schema-versioned diagnostics for long-lived dashboards

to_dict() includes schema_version: 1 at the top of the document. That's not a no-op — it's the affordance that lets you ship error records into Elastic / Datadog / a JSONB column today and still parse them cleanly when a future mrrc release evolves the shape. A six-month retention dashboard counting code == "E201" per vendor per week stays correct across mrrc upgrades because (a) codes never get renumbered and (b) the schema version tells you when you'd need to migrate.

The bytes fields render under _hex-suffixed keys (found_hex, bytes_near_hex) so the dict is JSON-serializable as-is, no custom encoder, no base64. Drop it straight into a structured-logging handler:

logger.error("marc_parse_error", extra={"mrrc": err.to_dict()})

The full shape is documented including the _cause flattening and include_traceback=True flag.

Reader-side inspection, pymarc-style

pymarc exposes MARCReader.current_exception / current_chunk for inspecting the last failure on the reader itself. As of v0.8.2, mrrc exposes the same two accessors, so pymarc code that reads them ports over directly. Under permissive=True, when a read yields None for an unsalvageable record, current_exception holds the swallowed exception and current_chunk holds that record's raw bytes:

reader = mrrc.MARCReader("harvest.mrc", permissive=True)
for record in reader:
    if record is None:
        err = reader.current_exception   # the swallowed MrrcException
        raw = reader.current_chunk       # bytes of the failed record
        quarantine(raw, err)
        continue
    ingest(record)

(Recoverable defects don't yield None under permissive — they come back as a record with the diagnostics attached on record.errors; the None path is for records that can't be salvaged at all.) The per-exception positional metadata covers the same ground when you catch in strict mode, and the worker-pool / per-shard pattern above scales to streams a single reader can't (a single process draining a 100M-record file) — but the reader-side accessors are there when a pymarc workflow leans on them. The pymarc compatibility note in the docs covers the remaining differences.

data.onebiglibrary.net

about colophon