User Guide

Overview

iob2labels converts character-offset NER span annotations (the format used by tools like Prodigy, Label Studio, and Doccano) into integer label sequences aligned to any HuggingFace-compatible tokenizer. At inference time, it converts model predictions back into span annotations.

The library depends only on tokenizers (HuggingFace Rust backend) and pydantic. No torch or transformers required.

Installation

uv add iob2labels

Or with pip:

pip install iob2labels

Setting Up the Encoder

The IOB2Encoder is the main interface. It requires two arguments: the entity class names and a tokenizer.

from iob2labels import IOB2Encoder

encoder = IOB2Encoder(
    labels=["actor", "character", "plot"],
    tokenizer="bert-base-uncased",
)

The labels Parameter

Pass a list of entity class names as strings. These are the NER categories in your annotation data. The encoder generates IOB2 tags from these labels:

  • Each entity class produces 2 labels: B-{LABEL} (beginning) and I-{LABEL} (inside)
  • Plus the O (outside) class
  • Total label count is always (n * 2) + 1
encoder.label_map
# {'O': 0, 'B-ACTOR': 1, 'I-ACTOR': 2, 'B-CHARACTER': 3, 'I-CHARACTER': 4, 'B-PLOT': 5, 'I-PLOT': 6}

The tokenizer Parameter

The tokenizer argument accepts three forms:

Checkpoint string — downloads the tokenizer from HuggingFace Hub. A UserWarning is emitted for checkpoints not in the tested list.

encoder = IOB2Encoder(labels=labels, tokenizer="bert-base-uncased")

tokenizers.Tokenizer instance — used directly.

from tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)

transformers.PreTrainedTokenizerFast — the underlying tokenizers.Tokenizer is unwrapped automatically via the .backend_tokenizer attribute.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)

Optional Parameters

Parameter Type Default Description
ignore_token int -100 Label value for special tokens ([CLS], [SEP], etc.). PyTorch's CrossEntropyLoss ignores this value by default.
conversion_check bool True Verify encoding correctness via round-trip check after every encoding. Disable for production performance.
max_length int \| None 512 Maximum token sequence length. Entities beyond the truncation boundary are skipped. Set to None to disable truncation.

Encoding Annotations

Single Annotation

Call the encoder directly with text and spans:

labels = encoder(
    text="Did Dame Judy Dench star in a British film about Queen Elizabeth?",
    spans=[
        {"label": "actor", "start": 4, "end": 19},
        {"label": "plot", "start": 30, "end": 37},
        {"label": "character", "start": 49, "end": 64},
    ]
)
# [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]

Each integer in the output corresponds to a token from the tokenizer:

  • -100 — special tokens ([CLS], [SEP]), ignored during loss computation
  • 0O (outside any entity)
  • 1B-ACTOR (beginning of an actor entity)
  • 2I-ACTOR (inside/continuation of an actor entity)
  • And so on for each entity class

Batch Encoding

For multiple annotations, use batch() which leverages the Rust-backed encode_batch() for parallelized tokenization:

annotations = [
    {"text": "Did Dame Judy Dench star?", "spans": [{"label": "actor", "start": 4, "end": 19}]},
    {"text": "Matt Damon was Jason Bourne.", "spans": [{"label": "actor", "start": 0, "end": 10}]},
]

results = encoder.batch(annotations)
# [[-100, 0, 1, 2, 2, 2, 0, -100], [-100, 1, 2, 0, 0, 0, 0, -100]]

Results are returned without padding. Use HuggingFace's DataCollatorForTokenClassification or your own padding logic for training.

The on_error parameter controls error handling:

  • "raise" (default) — raise on the first error
  • "skip" — skip failed annotations, return results for successful ones
results = encoder.batch(annotations, on_error="skip")

Decoding Predictions

At inference time, convert model predictions (after argmax) back into character-offset span annotations.

From Raw Text

Use decode_text() when you have the raw text but not the Encoding object:

spans = encoder.decode_text(predicted_labels, text)
# [{"start": 4, "end": 19, "label": "actor"}, ...]

This tokenizes the text internally, then decodes.

From a Pre-built Encoding

Use decode() when you already have the tokenizers.Encoding object (avoids re-tokenizing):

encoding = encoder.tokenizer.encode(text)
spans = encoder.decode(predicted_labels, encoding, text)

Both methods return list[Span] — a list of typed dicts with start, end, and label fields.

SentencePiece whitespace handling

Tokenizers like ALBERT, XLNet, T5, and XLM-RoBERTa absorb leading whitespace into tokens (e.g., ▁Queen maps to chars (48, 54) instead of (49, 54)). The decoder corrects these offsets automatically using the original text, so the returned spans always have accurate character boundaries.

Working with Custom Data Formats

Annotation tools use different field names. Configure the encoder to match your data format:

# BioMed-NER dataset uses "entities" and "class" instead of "spans" and "label"
encoder = IOB2Encoder(
    labels=["organism", "chemicals"],
    tokenizer="bert-base-uncased",
    spans_field="entities",
    label_field="class",
)

Available field name overrides:

Parameter Default Description
text_field "text" Key for the text string in batch annotation dicts
spans_field "spans" Key for the spans list in batch annotation dicts
start_field "start" Key for the start offset in span dicts
end_field "end" Key for the end offset in span dicts
label_field "label" Key for the entity label in span dicts

Annotation Validation

Input annotations are validated upfront with clear error messages:

  • Negative offsetsstart or end less than 0
  • Inverted spansstart >= end
  • Out-of-bounds spansend exceeds text length
  • Overlapping spans — IOB2 does not support overlapping entities
encoder(text="Hello", spans=[{"label": "test", "start": 0, "end": 100}])
# ValueError: Span 0 ('test') extends past the text (end=100, text length=5).
# Ensure character offsets are within the text bounds.

Conversion Checking

By default, every encoding is verified by recovering the entity text from the produced labels and comparing it to the original annotation. This catches tokenizer misalignment bugs early.

Disable it for production performance once you've verified correctness:

encoder = IOB2Encoder(
    labels=labels,
    tokenizer=tok,
    conversion_check=False,
)

Converting to Tensors

The encoder returns list[int], which can be converted to any tensor format:

import torch
x = torch.tensor(labels)

# or with numpy
import numpy as np
x = np.array(labels)

For batched training, the sequences are unpadded. Use HuggingFace's DataCollatorForTokenClassification to handle padding and label alignment:

from transformers import DataCollatorForTokenClassification

collator = DataCollatorForTokenClassification(tokenizer=hf_tokenizer)