mrrc v0.8.2 is out! The release notes are pretty lengthy. There's a lot of what I hope will be good stuff in there. v0.8.2 in particular included a lot of housekeeping - simplifying CI, dealing with some API asymmetries, tidying up docs, and generally making it harder to repeat some of the mistakes I've made without overburdening CI.
But the biggest thread through the 0.8 series has been a set of improvements for error handling:
- detailed info about record errors available through the exception mechanism
- explicit re-use and preservation of the pymarc exception hierarchy
- extension of the pymarc exception hierarchy with more-detailed types
- a defined set of error types and matching codes for easy reference/lookup
- full docs on the above
Unless I'm missing something, I suspect that some of this is new for MARC libraries and might be very helpful for managing bibliographic workflows that occasionally run into rogue records.
There are two new docs at Error handling and Error codes which cover the basics and details, respectively. Because this is all so new, though, I thought it might help to introduce them by asking the bot to work up an additional set of detailed practical examples of how to get more value out of the new error-handling approach.
Another set of improvements in this release has to do with dipping a toe into some light formal methods. I'll write more about that later, and in the meantime you can find more details in the docs. But for now, it seems worth noting that some of the lower-level errors in the new hierarchy level came out of initial uses of the new testing methods.
One fun thing to note: I've been sitting on this post for a while because when I originally went to publish v0.8, and was all excited about the new error handling approach, I was shocked to find (after not watching the diffs very closely during one long "let's just get it done" prompting session) that although the error model was fully designed and documented, the library didn't actually wire up error handling! It just wasn't in there. I spent so much time thinking about how it should feel from a dev experience perspective that I forgot to make sure that it actually works! Totally in keeping with the agentic coding experience, and I should've known better.
Once I recognized my failure - in part by having the bot draft up what's below the following examples and have it tell me "hey wait it looks like something's missing" - I had a detailed code review process generate a ton of beads/issues to clean up to be sure we had a clear target to aim for to get the whole thing working. The v0.8.1 release was largely about actually wiring up the errors throughout the library. A lot of other things popped up in the process, and I deferred those to v0.8.2. So, now I get to finalize this writeup and post it. Phew.
Check your bot output, friends.
Everything below this paragraph was written by claude/opus. Some of it is a little ... fanciful, but hopefully it's still usefully illustrative.
When records break: two libraries
Real harvests break in a handful of predictable ways — a record cut off
mid-transfer, a leader whose declared length is wrong, a subfield carrying
bytes that aren't valid UTF-8. Both libraries are forgiving by default:
pymarc yields None for a record it can't parse and stashes the failure on
reader.current_exception; mrrc's default permissive mode does much the
same (yielding None, or a record with diagnostics on record.errors).
The difference shows up when you ask mrrc to be strict — it raises a typed
exception, reusing pymarc's class names so existing except clauses keep
working, and its detailed() tells you exactly what broke and where.
A truncated record
A delivery gets cut off mid-record — a common FTP or concatenation mishap. pymarc flags it on the reader:
>>> reader = pymarc.MARCReader(open("harvest.mrc", "rb"))
>>> [r is not None for r in reader]
[True, False]
>>> reader.current_exception
TruncatedRecord()
>>> print(reader.current_exception)
Record length in leader is greater than the length of data
mrrc, with recovery_mode="strict", raises the same-named exception with
the specifics filled in:
>>> reader = mrrc.MARCReader("harvest.mrc", recovery_mode="strict")
>>> try:
... for record in reader: pass
... except mrrc.TruncatedRecord as e:
... print(e.code, e.slug)
... print(e.detailed())
E005 truncated_record
TruncatedRecord at record 2
length: expected 57 bytes, found 6
byte offset: 0x6C (108) in stream
record-relative: byte 24
A bad record length
A hand-edited or naively concatenated file leaves a leader whose first five bytes aren't a valid length. pymarc again surfaces it on the reader:
>>> reader = pymarc.MARCReader(open("harvest.mrc", "rb"))
>>> next(reader) is None
True
>>> print(reader.current_exception)
Invalid record length in first 5 bytes of record
mrrc points straight at the offending bytes:
>>> reader = mrrc.MARCReader("harvest.mrc", recovery_mode="strict")
>>> try: next(reader)
... except mrrc.RecordLengthInvalid as e:
... print(e.code, e.slug)
... print(e.detailed())
E001 record_length_invalid
RecordLengthInvalid at record 1
byte offset: 0x0 (0) in stream
bytes near offset 0x0:
0x0000: 41 42 43 44 45 6e 61 6d 20 20 32 32 30 30 30 34 |ABCDEnam 220004|
^^ offending byte
Invalid encoding
The most common gremlin of all: a subfield carries a byte that isn't valid UTF-8 — legacy MARC-8 data, a vendor's encoding slip. Here both libraries are lenient by default; each substitutes a replacement character and moves on (the exact substitution differs):
>>> next(pymarc.MARCReader(open("harvest.mrc", "rb")))["245"]["a"]
'Caf ̌Society'
>>> next(mrrc.MARCReader("harvest.mrc"))["245"]["a"] # U+FFFD
'Caf� Society'
Encoding isn't a structural problem, so — unlike truncation — pymarc
doesn't record it on current_exception; the substituted record just flows
through. If you want it flagged, mrrc's strict_marc validation turns it
into an error with the byte pinpointed:
>>> reader = mrrc.MARCReader("harvest.mrc",
... recovery_mode="strict",
... validation_level="strict_marc")
>>> try: next(reader)
... except mrrc.MrrcException as e:
... print(e.code, e.slug)
... print(e.detailed())
E301 utf8_invalid
EncodingError at record 1, field 245
001: ocm01234567
byte offset: 0x3D (61) in stream
bytes near offset 0x3D:
0x002D: 30 31 32 1e 6f 63 6d 30 31 32 33 34 35 36 37 1e |012.ocm01234567.|
0x003D: 31 30 1f 61 43 61 66 e9 20 53 6f 63 69 65 74 79 |10.aCaf. Society|
^^ offending byte
Across all three, the exception class names line up with pymarc's
(TruncatedRecord, RecordLengthInvalid, and the rest), so a pymarc-style
except ports straight over. What mrrc adds is the detailed() view —
record number, byte offset, expected-vs-found, and a hex dump you can paste
into a Slack thread — and the stable e.code / e.slug shown above
(E005, E001, E301), which never get renumbered across releases.
Carve out the bad record without re-parsing the harvest
record_byte_offset and byte_offset together let you locate the
record that broke without scanning the file again. The leader's first
five bytes are the declared record length, so the rest is arithmetic:
def isolate_record(path: str, err: mrrc.MrrcException) -> bytes:
"""Return the raw ISO 2709 bytes of the record that triggered err."""
record_start = err.byte_offset - err.record_byte_offset
with open(path, "rb") as fh:
fh.seek(record_start)
leader = fh.read(24)
declared_length = int(leader[:5])
fh.seek(record_start)
return fh.read(declared_length)
with open(f"quarantine/rec-{err.record_index}-{err.code}.mrc", "wb") as out:
out.write(isolate_record("harvest.mrc", err))
A 2 GB monthly delivery fails at record 1247 of 50,000. You carve
that record into quarantine/, mail it back to the vendor with the
detailed() output as the issue body, and let the rest of the harvest
ingest. The next vendor delivery's failures land in the same
quarantine directory tagged with their codes — easy to triage.
Code-prefix policies for ingest pipelines
Stable codes (the policy is documented)
let you write ingest decisions as data, not as a chain of isinstance
checks:
import mrrc
# Encoding noise during a MARC-8 → UTF-8 transition: warn, accept.
# Subfield/indicator damage: vendor-side corruption, reject and report.
# Directory damage: structural — quarantine for manual review.
# Truncation: probably a transfer issue — ask for a redelivery.
POLICY: dict[str, str] = {
"E0": "fail", # leader/stream — record is unreadable
"E1": "quarantine", # directory/field header — needs eyeballs
"E2": "reject", # subfield/indicator — vendor data bug
"E3": "warn", # encoding — usually transition noise
"E4": "fail", # writer-side — programmer error, not data
}
def disposition(err: mrrc.MrrcException) -> str:
return POLICY.get(err.code[:2], "fail")
Because no code is ever renumbered, this dict survives mrrc upgrades without re-validation. New codes within an existing range (E001–E007, E099, E101, E105–E106, E201–E202, E301, E401–E402, E404 as of v0.8.2) inherit the policy of their range automatically.
Codes as cross-team identifiers
Three places stable codes change how teams talk about MARC bugs:
- Slack / chat threads. "We're seeing a spike in E301 from $vendor"
is a search-engine-able phrase; "got a UTF-8 thing" is not. The
slugis the human form of the same identity, sof"{e.code} ({e.slug})"reads naturally in either an alert or a ticket title. - Issue trackers. A snippet for a Jira / Linear / GitHub Issues template:
python
body = "\n".join([
f"## {err.code} — {err.slug}",
f"Record: {err.record_index} 001: {err.record_control_number} field: {err.field_tag}",
f"Docs: {err.help_url()}",
"",
"```",
err.detailed(),
"```",
])
- Cross-language teams.
MarcError::detailed()in Rust andMrrcException.detailed()in Python emit byte-for-byte identical output, so a Rust pipeline's CI log and a Python ingest worker's Sentry event can be visually compared without a translator. Same codes, same hex dump format, same exact characters.
Worker pools: rich failures across processes
The structured exception round-trips through pickle with all positional attributes intact, which means the worker that parses record N can hand a full diagnostic back to the parent that aggregates results — no JSON serialization in the middle, no message parsing on the receiving end.
import concurrent.futures, collections, mrrc
def parse_one(record_bytes: bytes) -> mrrc.Record:
# strict so a defect raises (the v0.8.2 default is permissive);
# strict_marc so indicator-level defects (E201) are caught too.
return next(mrrc.MARCReader(record_bytes, recovery_mode="strict",
validation_level="strict_marc"))
def shard(path: str, n: int) -> list[bytes]:
"""Split harvest into n chunks of one-or-more records each."""
... # left as an exercise; record boundaries are 0x1D bytes
by_code: collections.Counter[str] = collections.Counter()
samples: dict[str, mrrc.MrrcException] = {}
with concurrent.futures.ProcessPoolExecutor() as pool:
for fut in concurrent.futures.as_completed(
pool.submit(parse_one, b) for b in shard("harvest.mrc", n=8)
):
try:
rec = fut.result()
ingest(rec)
except mrrc.MrrcException as e:
by_code[e.code] += 1
samples.setdefault(e.code, e) # one example per code
for code, n in by_code.most_common():
s = samples[code]
print(f"{code} {s.slug:24} {n:>5} e.g. record {s.record_index} field {s.field_tag}")
E201 invalid_indicator 12 e.g. record 847 field 245
E301 utf8_invalid 4 e.g. record 213 field 100
E106 invalid_field 1 e.g. record 1024 field 008
The samples dict holds live exception objects — you can call
.detailed(), .to_dict(), or .help_url() on any of them later in
the pipeline without re-parsing.
This works because each worker processes one record at a time, so the "strict-mode iterator stops at the first failure" issue goes away.
If you'd rather stay in a single process, MARCReader.iter_with_errors()
drains the whole stream in order without losing anything. It yields
(record, errors) tuples — errors carries any diagnostics recorded
during a salvaged record's parse — and under permissive=True an
unsalvageable record arrives as (None, [exception]) rather than being
silently skipped, so the same per-code tally works against one reader:
reader = mrrc.MARCReader("harvest.mrc", permissive=True)
for record, errors in reader.iter_with_errors():
if record is None: # unsalvageable — the fatal exception
by_code[errors[0].code] += 1
samples.setdefault(errors[0].code, errors[0])
else:
ingest(record)
for e in errors: # salvaged, but defects were recorded
by_code[e.code] += 1
(recovery_mode="lenient"
is the other single-pass option — it salvages partial records rather
than surfacing the exceptions.)
When you drain a whole stream in lenient / permissive, max_errors=N
is the circuit-breaker for "this delivery is too broken to bother
finishing": once more than N recovered defects accumulate, the next read
raises mrrc.FatalReaderError (E099) instead of plodding on. Pass
max_errors=0 (or omit it) to disable the cap.
Schema-versioned diagnostics for long-lived dashboards
to_dict() includes schema_version: 1 at the top of the document.
That's not a no-op — it's the affordance that lets you ship error
records into Elastic / Datadog / a JSONB column today and still parse
them cleanly when a future mrrc release evolves the shape. A six-month
retention dashboard counting code == "E201" per vendor per week stays
correct across mrrc upgrades because (a) codes never get renumbered
and (b) the schema version tells you when you'd need to migrate.
The bytes fields render under _hex-suffixed keys (found_hex,
bytes_near_hex) so the dict is JSON-serializable as-is, no custom
encoder, no base64. Drop it straight into a structured-logging
handler:
logger.error("marc_parse_error", extra={"mrrc": err.to_dict()})
The full
shape is documented
including the _cause flattening and include_traceback=True flag.
Reader-side inspection, pymarc-style
pymarc exposes MARCReader.current_exception / current_chunk for
inspecting the last failure on the reader itself. As of v0.8.2, mrrc
exposes the same two accessors, so pymarc code that reads them ports over
directly. Under permissive=True, when a read yields None for an
unsalvageable record, current_exception holds the swallowed exception
and current_chunk holds that record's raw bytes:
reader = mrrc.MARCReader("harvest.mrc", permissive=True)
for record in reader:
if record is None:
err = reader.current_exception # the swallowed MrrcException
raw = reader.current_chunk # bytes of the failed record
quarantine(raw, err)
continue
ingest(record)
(Recoverable defects don't yield None under permissive — they come
back as a record with the diagnostics attached on record.errors; the
None path is for records that can't be salvaged at all.) The
per-exception positional metadata covers the same ground when you catch
in strict mode, and the worker-pool / per-shard pattern above scales to
streams a single reader can't (a single process draining a 100M-record
file) — but the reader-side accessors are there when a pymarc workflow
leans on them. The
pymarc compatibility note
in the docs covers the remaining differences.
See also
- Error handling reference — the API surface, full field reference, recovery-mode semantics, and the pymarc compatibility table.
- Error codes reference — every code as of v0.8.2 with
Context / Applies to / Populatestriples and the stability policy. - Python API reference —
MARCReaderconstructor flags, includingpermissive=Trueand therecovery_modestrategies.