Multi-Agent Quiz Generation System¶
Overview¶
The quiz generation system uses a multi-agent architecture powered by LangGraph to create high-quality quiz questions from regulatory documents. The system employs four specialized AI agents that work together to generate, judge, and validate questions, with configurable providers and models per agent.
Architecture¶
Agent Roles¶
The system consists of five agents, each with a specific responsibility:
- Conceptual Generator: Generates questions focused on theoretical understanding, definitions, and fundamental principles
- Practical Generator: Creates scenario-based questions testing real-world application of regulations
- Validator: Performs strict validation to ensure questions meet structural and content requirements
- Refiner: Fixes issues identified by the validator while preserving the original question's intent
- Judge: Makes the final decision on which questions to accept or reject for the end user
Supported providers: Anthropic, Google, Mistral, OpenAI, and Cohere.
Workflow Pipeline¶
┌─────────────────────┐
│ Input: Regulation │
│ Chunk (Article) │
└──────────┬──────────┘
│
v
┌──────┴──────┐
│ │
v v
┌────────────┐ ┌────────────┐
│Conceptual │ │Practical │ (Parallel)
│Generator │ │Generator │
└─────┬──────┘ └──────┬─────┘
│ │
└───────┬───────┘
│
v
┌───────────────────────┐
│ Validator │
│ - Format Check │
│ - Content Check │
│ - Quality Score │
│ (Both Q&As) │
└──────────┬────────────┘
│
v
┌───────────────────────┐
│ Refiner │
│ - Issues + Warnings │
│ - Preserve Intent │
└──────────┬────────────┘
│
v
┌───────────────────────┐
│ Judge │
│ - Accept Both │
│ - Accept One │
│ - Reject Both │
└──────────┬────────────┘
│
v
┌───────────────────────┐
│ Human Feedback │
│ - Accept │
│ - Reject │
│ - Request Improve │
└───────────────────────┘
Agent Details¶
Question Generators (Conceptual & Practical)¶
Models: Configurable provider/model per generator
Conceptual Generator focuses on theoretical knowledge: - Definitions and terminology - Fundamental principles - "What is" and "how is it defined" questions
Practical Generator focuses on real-world application: - Scenario-based questions - Application of rules - "What should you do" and "how would you apply" questions
Important Constraint: Questions must NOT reference regulation names, numbers, articles, or sections. Questions must be fully standalone to prevent confusion in multi-regulation scenarios.
Output Format:
{
"question": "Question text",
"options": {"A": "...", "B": "...", "C": "...", "D": "..."},
"correct_answer": "A",
"explanations": {"A": "Why correct", "B": "Why wrong", "C": "Why wrong", "D": "Why wrong"},
"difficulty": "easy|medium|hard",
"focus": "conceptual|practical"
}
Refiner¶
Purpose: Fix issues identified by the validator while preserving the original question's intent and focus.
Model: Configurable provider/model
Refinement Actions: - Fix options that are not plausible enough - Improve explanations that don't properly hint at why wrong answers are incorrect - Clarify unclear or ambiguous question wording - Complete missing or incomplete explanations - Adjust options that are too obviously wrong
Important: The refiner only runs on questions that failed validation. Questions that pass validation are returned unchanged.
Output Format: Same JSON structure as generators, with additional field:
{
...
"refinement_notes": "Brief description of what was fixed",
"refiner_model": "gpt-4o"
}
Validator¶
Purpose: Perform strict validation of question format and content requirements before refinement.
Model: Configurable provider/model
Validation Checks (run before judging):
Structural Requirements: 1. Has exactly 4 multiple choice options (A, B, C, D) 2. Has exactly one correct answer marked 3. Has explanation for all 4 options 4. Each explanation is 1-2 sentences maximum 5. Question text is clear and complete
Content Requirements: 6. Correct answer explanation confirms why it's right 7. Wrong answer explanations explain why they're wrong (act as hints) 8. All options are plausible (not obviously wrong) 9. Question is unambiguous 10. Based strictly on provided regulation content
Output Format:
{
"valid": true,
"issues": [],
"warnings": [],
"checks_passed": {
"has_4_options": true,
"has_correct_answer": true,
"has_all_explanations": true,
"explanations_concise": true,
"question_clear": true,
"correct_explanation": true,
"wrong_explanations_are_hints": true,
"options_plausible": true,
"question_unambiguous": true,
"regulation_based": true
},
"score": 10
}
Judge¶
Purpose: Make the final accept/reject decision on refined questions.
Model: Configurable provider/model
Decision Types:
- accept_both: Both questions are high quality and test different aspects
- accept_conceptual: Only the conceptual question is acceptable
- accept_practical: Only the practical question is acceptable
- reject_both: Neither question meets the required standards
Important: The judge does NOT refine questions. Questions are already refined by the Refiner agent before reaching the judge. The judge only decides which refined questions to accept.
Evaluation Criteria: - Validator's pass/fail and issues for each question (primary filter) - Whether refinement successfully addressed the issues - Accuracy: Does it correctly reflect the regulation? - Distinctiveness: Do the two questions test different skills? - Difficulty: Is it appropriate for certification level?
Output Format:
{
"decision": "accept_both|accept_conceptual|accept_practical|reject_both",
"reasoning": "Brief explanation of your decision, referencing validator results"
}
State Management¶
The workflow uses LangGraph's state management to track progress through the pipeline.
State Schema¶
{
# Input
"chunk": Dict, # Regulation chunk to process
"improvement_feedback": str, # Optional human feedback
# Generated Q&As
"conceptual_qa": Dict, # Output from conceptual generator
"practical_qa": Dict, # Output from practical generator
# Validation results (before refinement)
"validation_results": List[Dict], # Results for each question
"all_valid": bool, # Whether all passed validation
# Refined Q&As (after refinement)
"refined_conceptual_qa": Dict, # Refined conceptual question
"refined_practical_qa": Dict, # Refined practical question
# Judge output
"judge_decision": str, # accept_both|accept_conceptual|accept_practical|reject_both
"judge_reasoning": str, # Explanation (references validator results)
"judged_qas": Dict, # Final Q&As after judging
# Final output
"final_questions": List[Dict], # Questions ready for use
# Human feedback
"human_feedback": str, # Improvement suggestions
"human_action": str, # accept|reject|improve
# Status
"current_step": str, # Current workflow step
"errors": List[str] # Any errors encountered
}
Configuration¶
Agent Configuration¶
All agents are configured through the AgentConfig class:
from quiz_gen.agents.config import AgentConfig
config = AgentConfig(
# Provider + Model Selection per agent
conceptual_provider="openai",
practical_provider="anthropic",
validator_provider="google",
refiner_provider="openai",
judge_provider="mistral",
conceptual_model="gpt-4o",
practical_model="claude-sonnet-4-20250514",
validator_model="gemini-2.5-flash",
refiner_model="gpt-4o",
judge_model="mistral-large-latest",
# Provider-Specific Settings
anthropic_max_tokens=4096, # Required by Anthropic API (default: 4096)
# Workflow Settings
auto_accept_valid=False,
save_intermediate_results=True,
output_directory="data/quizzes",
# Validation Settings
min_validation_score=6,
strict_validation=True,
)
# Note: Temperature is not configured and provider defaults are used.
Using Cohere¶
Cohere uses its own SDK:
config = AgentConfig(
# Use Cohere for some agents (uses COHERE_API_KEY)
conceptual_provider="cohere",
conceptual_model="command-r-plus-08-2024", # Cohere model name
# Can still use other providers
validator_provider="openai",
judge_provider="anthropic", # Regular Anthropic/Claude
)
Environment Variables¶
The system automatically loads configuration from a .env file:
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
MISTRAL_API_KEY=your_mistral_key
GEMINI_API_KEY=your_google_key # Google Gemini models
COHERE_API_KEY=your_cohere_key
Usage¶
Basic Usage¶
from quiz_gen.agents.workflow import QuizGenerationWorkflow
from quiz_gen.agents.config import AgentConfig
# Initialize
config = AgentConfig()
workflow = QuizGenerationWorkflow(config)
# Process a single chunk
chunk = {
"section_type": "article",
"number": "47",
"title": "Article 47 - Delegated powers",
"content": "...",
"hierarchy_path": ["Article 47"]
}
result = workflow.run(chunk)
Batch Processing¶
# Process multiple chunks
chunks = load_chunks_from_file("data/processed/regulation.json")
results = workflow.run_batch(chunks, save_output=True)
With Improvement Feedback¶
# First attempt
result = workflow.run(chunk)
# If human requests improvements
result = workflow.run(
chunk,
improvement_feedback="Make the question more specific to drones"
)
Human-in-the-Loop¶
The workflow includes a placeholder for human feedback integration. In production, this connects to a UI where domain experts can:
- Review generated questions: See both conceptual and practical questions with all options and explanations
- Accept questions: Approve high-quality questions for the quiz database
- Reject questions: Discard questions that don't meet standards
- Request improvements: Provide specific feedback that will be incorporated in the next generation cycle
Human Feedback Loop¶
When human feedback is "improve", the workflow: 1. Captures the improvement suggestions 2. Routes back to the generation stage 3. Passes feedback to both generators 4. Generators incorporate the feedback into their prompts 5. New questions are generated, judged, and validated 6. Process repeats until accepted or rejected
Output¶
Question Output Format¶
Each validated question includes:
{
"question": "Full question text with scenario if practical",
"options": {
"A": "Option A text",
"B": "Option B text",
"C": "Option C text",
"D": "Option D text"
},
"correct_answer": "A",
"explanations": {
"A": "Explanation of why A is correct",
"B": "Hint about why B is wrong",
"C": "Hint about why C is wrong",
"D": "Hint about why D is wrong"
},
"source_reference": "Regulation X > Section Y > Article Z",
"difficulty": "medium",
"focus": "conceptual",
"generator": "conceptual",
"model": "gpt-4o"
}
Note: The source_reference field is added by the workflow (not by agents) from the chunk's hierarchy_path, formatted as elements joined with " > ".
Saved Results¶
Results are saved to JSON files in the configured output directory:
{
"chunk": {...},
"questions": [...],
"judge_decision": "accept_both",
"judge_reasoning": "...",
"validation_results": [...],
"all_valid": true,
"errors": []
}
Error Handling¶
The workflow includes comprehensive error handling:
- Generation failures: Captured and logged in the errors list, workflow continues
- Validation failures: Questions that fail validation are not accepted
- State errors: Tracked in the errors list for debugging
Errors are accumulated in the state and can be reviewed in the final output.
Note: Retry logic for API errors is configured in AgentConfig but not currently implemented in the workflow.
Performance Considerations¶
API Calls per Chunk¶
For each regulation chunk: - 2 parallel generation calls (conceptual + practical) - 2 validation calls (one for each generated question) - 2 refinement calls (one for each question, skipped if validation passes) - 1 judge call
Total: Up to 7 API calls per chunk (5 if both questions pass validation)
Cost Optimization¶
- Use parallel generation to minimize latency
- Validator prevents low-quality questions from requiring regeneration
- Judge reduces redundant questions
- Batch processing amortizes initialization overhead
Caching¶
The workflow uses LangGraph's MemorySaver checkpoint to enable:
- Resume interrupted workflows
- Review historical decisions
- Debug state transitions
Extension Points¶
The system is designed to be extensible:
- Add new agent types: Implement new generators with different focuses
- Custom validation rules: Extend the Validator with domain-specific checks
- Alternative models: Configure different LLMs for each role
- Custom workflows: Modify the LangGraph pipeline for different processes
- Storage backends: Implement custom storage for questions (database, cloud)
Best Practices¶
- Analyze question diversity: Ensure conceptual and practical questions are distinct
- Track validation scores: Identify common failure patterns
- Review judge decisions: Monitor acceptance rates to tune generator prompts
- Collect human feedback: Use feedback to improve generator prompts over time
- Version control prompts: Track prompt changes and their impact on quality
- Monitor API costs: Set budgets and rate limits appropriately
- Test with sample data: Validate workflow before processing large batches
See Also¶
- Parsers Documentation - How to parse regulation documents