Skip to content

Document Parsers

The quiz-gen package provides specialized parsers for extracting structured content from regulatory and legal documents.

EUR-Lex Parser

The EURLexParser is designed to parse European Union legal documents from the EUR-Lex database, extracting hierarchical structure, table of contents, and content chunks.

Overview

The EUR-Lex parser processes HTML documents and extracts:

  • Document Title: Main regulation/directive title with full citation
  • Table of Contents: Complete hierarchical structure (3-4 levels)
  • Content Chunks: Granular content units (title, citations, recitals, articles, annexes, concluding formulas)

All content is cleaned and formatted for optimal readability and downstream processing.

Features

  • ✅ Flexible hierarchy support (3-4 levels)
  • ✅ Automatic structure detection
  • ✅ Smart text cleaning (preserves lists and paragraphs)
  • ✅ Both URL and local file input
  • ✅ JSON export for chunks and TOC
  • ✅ Complete metadata preservation
  • ✅ Table extraction from annexes
  • ✅ Cross-reference tracking

Basic Usage

Parsing from URL

from quiz_gen import EURLexParser

# Initialize parser with EUR-Lex URL
url = "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32018R1139"
parser = EURLexParser(url=url)

# Parse document
chunks, toc = parser.parse()

# Display summary
print(f"Extracted {len(chunks)} chunks")
print(f"Document: {toc['title']}")

Parsing from Local File

from quiz_gen import EURLexParser

# Read local HTML file
with open('data/documents/regulation.html', 'r', encoding='utf-8') as f:
    html_content = f.read()

# Initialize parser with HTML content
parser = EURLexParser(html_content=html_content)
chunks, toc = parser.parse()

Saving Results

# Save to JSON files
parser.save_chunks('output/chunks.json')
parser.save_toc('output/toc.json')

# Print formatted table of contents
parser.print_toc()

Document Structure

EUR-Lex documents follow a hierarchical structure that the parser automatically detects:

Level 0: Document Title

  • Regulation/Directive full title
  • Date and reference information

Level 1: Major Sections

  • Preamble: Citations and recitals
  • Enacting Terms: Main regulatory content (chapters/articles)
  • Concluding Formulas: Signatures and adoption information
  • Annexes: Supplementary material (I-X)

Level 2: Structural Divisions

  • Citation: Combined citation paragraph
  • Recitals: Numbered recitals (chunked individually)
  • Chapters: Major content divisions

Level 3: Sub-Divisions

  • Sections: Optional subdivisions within chapters
  • Articles: Main content units (chunked individually)

Level 4: Nested Content (when sections exist)

  • Articles: Within sections

Content Chunks

The parser creates discrete chunks for the following content types:

Title

{
  "section_type": "title",
  "number": null,
  "title": "REGULATION (EU) 2018/1139...",
  "content": "Full regulation title and metadata",
  "hierarchy_path": ["REGULATION (EU) 2018/1139..."],
  "metadata": {"id": "tit_1"}
}

Citation

All citations combined into a single chunk:

{
  "section_type": "citation",
  "number": null,
  "title": "Citation",
  "content": "Having regard to...\n\nHaving regard to...",
  "hierarchy_path": ["REGULATION...", "Preamble", "Citation"],
  "metadata": {"id": "cit_1", "citation_ids": ["cit_1", "cit_2", ...]}
}

Recitals

Individual chunks for each recital:

{
  "section_type": "recital",
  "number": "1",
  "title": "Recital 1",
  "content": "A high and uniform level of civil aviation...",
  "hierarchy_path": ["REGULATION...", "Preamble", "Recital 1"],
  "metadata": {"id": "rct_1"}
}

Articles

Individual chunks for each article:

{
  "section_type": "article",
  "number": "1",
  "title": "Article 1 - Subject matter and objectives",
  "content": "1. The principal objective...\n\n2. This Regulation...",
  "hierarchy_path": ["REGULATION...", "CHAPTER I - PRINCIPLES", "Article 1..."],
  "metadata": {"id": "art_1", "subtitle": "Subject matter and objectives"}
}

Annexes

Individual chunks for each annex (including tables):

{
  "section_type": "annex",
  "number": "I",
  "title": "ANNEX I - Aircraft referred to...",
  "content": "Historic aircraft meeting...",
  "hierarchy_path": ["REGULATION...", "ANNEX I - Aircraft..."],
  "metadata": {"id": "anx_I", "subtitle": "Aircraft referred to..."}
}

Concluding Formulas

{
  "section_type": "concluding_formulas",
  "number": null,
  "title": "Concluding formulas",
  "content": "This Regulation shall be binding...",
  "hierarchy_path": ["REGULATION...", "Concluding formulas"],
  "metadata": {"id": "fnp_1"}
}

Advanced Usage

Filtering Chunks by Type

from quiz_gen import SectionType

# Get only articles
articles = [c for c in chunks if c.section_type == SectionType.ARTICLE]

# Get only recitals
recitals = [c for c in chunks if c.section_type == SectionType.RECITAL]

# Get content from specific chapter
chapter_1 = [c for c in chunks if 'CHAPTER I' in ' > '.join(c.hierarchy_path)]

Working with Hierarchy

# Print hierarchy for each chunk
for chunk in chunks:
    hierarchy = ' > '.join(chunk.hierarchy_path)
    print(f"{hierarchy}")

# Find parent chapter of an article
article = chunks[100]  # Some article
parent = article.hierarchy_path[-2] if len(article.hierarchy_path) > 1 else None

Accessing Metadata

# Navigation IDs for linking to original document
for chunk in chunks:
    doc_id = chunk.metadata.get('id')
    # Use to construct URL: base_url + "#" + doc_id

# Article subtitles
articles = [c for c in chunks if c.section_type == SectionType.ARTICLE]
for article in articles:
    subtitle = article.metadata.get('subtitle', '')
    print(f"{article.title}: {subtitle}")

Custom Processing

# Extract specific articles by number
def get_article(chunks, number):
    for chunk in chunks:
        if chunk.section_type == SectionType.ARTICLE and chunk.number == number:
            return chunk
    return None

article_5 = get_article(chunks, "5")

# Search content
def search_chunks(chunks, query):
    return [c for c in chunks if query.lower() in c.content.lower()]

safety_chunks = search_chunks(chunks, "safety")

Text Formatting

The parser applies intelligent text cleaning:

List Formatting

Original HTML:           →    Cleaned Output:
(a)                           (a) contribute to the policy

contribute to the policy      (b) facilitate the movement

(b)

facilitate the movement

Paragraph Formatting

Original HTML:           →    Cleaned Output:
1.                            1. The principal objective...

The principal objective       2. This Regulation aims to...

2.

This Regulation aims to

Table of Contents Structure

The TOC JSON structure:

{
  "title": "REGULATION (EU) 2018/1139 OF THE EUROPEAN PARLIAMENT...",
  "sections": [
    {
      "type": "preamble",
      "title": "Preamble",
      "children": [
        {"type": "citation", "title": "Citation"},
        {"type": "recital", "number": "1", "title": "Recital 1"},
        {"type": "recital", "number": "2", "title": "Recital 2"}
      ]
    },
    {
      "type": "enacting_terms",
      "title": "Enacting Terms",
      "children": [
        {
          "type": "chapter",
          "number": "I",
          "title": "CHAPTER I - PRINCIPLES",
          "children": [
            {"type": "article", "number": "1", "title": "Article 1 - Subject..."},
            {"type": "article", "number": "2", "title": "Article 2 - Scope"}
          ]
        },
        {
          "type": "chapter",
          "number": "III",
          "title": "CHAPTER III - SUBSTANTIVE REQUIREMENTS",
          "children": [
            {
              "type": "section",
              "number": "I",
              "title": "SECTION I - Airworthiness...",
              "children": [
                {"type": "article", "number": "9", "title": "Article 9..."}
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "concluding_formulas",
      "title": "Concluding formulas"
    },
    {
      "type": "annex",
      "number": "I",
      "title": "ANNEX I - Aircraft referred to..."
    }
  ]
}

Supported Document Types

The parser is optimized for EUR-Lex HTML documents:

Fully Supported

  • ✅ EU Regulations
  • ✅ EU Directives
  • ✅ EU Decisions
  • ✅ Aviation regulations (EASA)
  • ✅ Multi-level hierarchies (chapters, sections, articles)
  • ✅ Annexes (text and tables)

Required HTML Structure

Documents must contain: - <div class="eli-main-title"> - Document title - <div class="eli-subdivision" id="cit_*"> - Citations - <div class="eli-subdivision" id="rct_*"> - Recitals - <div id="cpt_*"> - Chapters - <div class="eli-subdivision" id="art_*"> - Articles - <div class="eli-container" id="anx_*"> - Annexes

API Reference

EURLexParser

Main parser class for EUR-Lex documents.

from quiz_gen import EURLexParser

Constructor

EURLexParser(url: str = None, html_content: str = None)

Parameters: - url (str, optional): URL of a EUR-Lex document to fetch - html_content (str, optional): Raw HTML string to parse directly

One of url or html_content must be provided.

Methods

parse()
parse() -> tuple[List[RegulationChunk], Dict]

Parse document and return chunks and table of contents.

Returns: - Tuple of (chunks list, TOC dictionary)

Raises: - ValueError: If no content available to parse - HTTPError: If URL fetch fails

fetch()
fetch() -> str

Fetch HTML content from URL.

Returns: - HTML content string

save_chunks()
save_chunks(filepath: str) -> None

Save chunks to JSON file.

Parameters: - filepath (str): Output file path

save_toc()
save_toc(filepath: str) -> None

Save table of contents to JSON file.

Parameters: - filepath (str): Output file path

print_toc() -> None

Print formatted table of contents to console.

RegulationChunk

Data class representing a parsed content chunk.

from quiz_gen import RegulationChunk

Attributes

  • section_type (SectionType): Type of section
  • number (str | None): Section number
  • title (str): Full title including subtitle
  • content (str): Cleaned text content
  • hierarchy_path (list[str]): Ancestor titles from document root
  • metadata (dict): Additional structured data (id, subtitle, etc.)

Methods

to_dict()
to_dict() -> Dict

Convert chunk to dictionary for JSON serialization.

SectionType

Enumeration of document content types.

from quiz_gen import SectionType

Values

  • TITLE: Document title
  • PREAMBLE: Preamble section header
  • CITATION: Combined citation block
  • RECITAL: Individual recital
  • ENACTING_TERMS: Enacting terms section header
  • CHAPTER: Chapter
  • SECTION: Section within a chapter
  • ARTICLE: Article (main content unit)
  • CONCLUDING_FORMULAS: Concluding signatures
  • ANNEX: Annex

Performance Considerations

Memory Usage

  • Large regulations (500+ pages) may generate 500+ chunks
  • Each chunk typically 100-2000 bytes
  • Total memory: ~1-5 MB per regulation

Processing Time

  • Typical regulation: 2-5 seconds
  • Network fetch: 1-3 seconds (if using URL)
  • Parsing: 1-2 seconds
  • Large documents (10+ annexes): up to 10 seconds

Optimization Tips

from quiz_gen import EURLexParser

# Create a new parser instance per document
for html_file in html_files:
    with open(html_file, encoding="utf-8") as f:
        html = f.read()
    parser = EURLexParser(html_content=html)
    chunks, toc = parser.parse()
    # process chunks ...

Troubleshooting

Common Issues

Issue: Empty chunks or missing content

# Check if HTML has correct EUR-Lex structure
parser = EURLexParser(html_content=html)
if not parser.soup.find('div', class_='eli-main-title'):
    print("Not a valid EUR-Lex document")

Issue: Incorrect hierarchy

# Verify chapter/section IDs match EUR-Lex format
chapters = parser.soup.find_all('div', id=re.compile(r'^cpt_'))
print(f"Found {len(chapters)} chapters")

Issue: Text formatting problems

Use --verbose on the CLI or inspect chunk.content directly to review how text was cleaned.

Examples

See the Examples page for complete working examples including: - Batch processing multiple documents - Building a searchable database - Generating comparative analyses - Creating study guides

Next Steps