System Architecture

How the Radiology Code Semantic Cleaner Works

🏗️ System Overview

The Radiology Code Semantic Cleaner is a sophisticated AI-powered system that standardizes radiology exam names using a two-stage pipeline: retrieval and reranking.

🎯 Primary Goal: Transform inconsistent radiology exam names into standardized NHS reference terms with high accuracy and confidence scoring.
Input: "CT CHEST W/ CONTRAST" ↓ 1. Preprocessing & Component Extraction ↓ 2. Semantic Retrieval (BioLORD/Default) ↓ 3. Candidate Reranking (MedCPT/OpenRouter) ↓ 4. Confidence Scoring & Final Selection ↓ Output: "CT Chest with IV Contrast" (95% confidence)

🔧 Technical Architecture

Frontend (Client-Side)

  • Framework: Pure JavaScript with ES6 modules
  • UI Components: React-like components using createElement
  • State Management: Local state with localStorage persistence
  • Styling: Unified CSS design system
  • File Processing: Client-side JSON parsing and validation

Backend (Server-Side)

  • Framework: Flask (Python)
  • Deployment: Render.com with auto-scaling
  • Storage: Cloudflare R2 for configuration and results
  • Processing: Multi-threaded batch processing
  • APIs: RESTful endpoints with CORS support

AI/ML Components

🔍 Retrieval Models

BioLORD: Biomedical language representation model optimized for medical terminology

🎯 Reranking Models

MedCPT: Medical cross-encoder via HuggingFace API

OpenRouter: GPT-4, Claude, Gemini via unified API

🧠 Processing Pipeline

Stage 1: Preprocessing

  1. Text Normalization: Clean and standardize input text
  2. Component Extraction: Identify anatomy, modality, contrast, laterality
  3. Context Detection: Recognize gender, age, and clinical context
  4. Abbreviation Expansion: Convert common medical abbreviations

Stage 2: Semantic Retrieval

  1. Embedding Generation: Convert input to vector representation
  2. Similarity Search: Find top candidates from NHS database
  3. Filtering: Apply modality and context filters
  4. Candidate Selection: Return top N most similar matches

Stage 3: Intelligent Reranking

  1. Cross-Encoder Scoring: Deep semantic understanding of query-candidate pairs
  2. Component Alignment: Weight matching of extracted components
  3. Medical Logic: Apply domain-specific scoring rules
  4. Final Ranking: Combine scores with weighted fusion

Stage 4: Post-Processing

  1. Confidence Calculation: Generate reliability scores
  2. SNOMED Mapping: Link to standardized terminology
  3. Quality Validation: Apply consistency checks
  4. Result Formatting: Structure output for consumption

📊 Model Comparison

Reranker Model Characteristics

Model Type Strengths Best Use Case
MedCPT Cross-Encoder Medical domain expertise, high accuracy Production processing, maximum accuracy
GPT-4 LLM Contextual understanding, reasoning Complex cases, edge scenarios
Claude LLM Careful analysis, detailed explanations Quality review, explanation generation
Gemini LLM Fast processing, good balance High-volume processing, cost efficiency

🔄 Data Flow

Input Processing

  1. User uploads JSON file via drag-and-drop interface
  2. Client validates file format and structure
  3. Data sent to backend via batch API endpoint
  4. Server processes in chunks of 10 with progress tracking

Storage & Caching

  • Configuration: YAML files stored in Cloudflare R2
  • Results: Large datasets stored in R2 with public URLs
  • Progress: Real-time progress files for batch tracking
  • Cache: NHS reference data cached for performance

API Endpoints

  • /health - System health check
  • /models - Available AI models and status
  • /parse_enhanced - Single exam processing
  • /parse_batch - Batch processing with progress
  • /batch_progress/{id} - Progress tracking
  • /config/current - Configuration management

Configuration System

Dynamic Configuration

The system uses YAML configuration files stored in Cloudflare R2 for:

  • Model Weights: Retriever vs reranker score balancing
  • Confidence Thresholds: Quality control parameters
  • Component Scoring: Anatomy, modality, contrast weights
  • NHS Reference Data: Source mappings and filtering rules

Real-Time Updates

Configuration changes trigger automatic cache rebuilding, allowing for:

  • A/B testing of different model configurations
  • Performance tuning based on accuracy metrics
  • Domain-specific customization for different hospitals
  • Rapid deployment of improvements without code changes
⚠️ Important: Configuration changes affect all subsequent processing. Always test changes with the 100 exam test suite before production use.

Performance & Scalability

Optimization Strategies

  • Batch Processing: Process multiple exams concurrently
  • Chunked Processing: Break large datasets into manageable pieces
  • Progress Tracking: Real-time feedback for long-running jobs
  • Model Caching: Pre-load embeddings and models
  • Result Streaming: Write results to disk as they're processed

Scalability Features

  • Auto-scaling: Render.com handles traffic spikes
  • CDN Distribution: Cloudflare for global performance
  • Async Processing: Non-blocking operations where possible
  • Resource Management: Memory-efficient processing

🔐 Security & Compliance

Data Protection

  • HTTPS: All communications encrypted in transit
  • CORS: Restricted cross-origin access
  • Input Validation: Sanitization of all user inputs
  • No PHI Storage: No personally identifiable health information stored

Healthcare Compliance

  • Audit Logging: All processing activities logged
  • Version Control: Configuration changes tracked
  • Quality Assurance: Built-in testing frameworks
  • Access Control: Configuration editing restricted

🔗 Integration Points

External APIs

  • HuggingFace: MedCPT model inference
  • OpenRouter: LLM access (GPT, Claude, Gemini)
  • Cloudflare R2: Configuration and result storage
  • NHS TRUD: Reference terminology source

Data Sources

  • NHS Reference Data: Standardized exam terminology
  • SNOMED CT: Medical concept identifiers
  • Local Mappings: Hospital-specific terminology
  • Training Data: Validated exam name pairs