Masonmason 0ae1f1c0e2 Add comprehensive inference pipeline system with UI framework

- Add InferencePipeline: Multi-stage inference orchestrator with thread-safe queue management
- Add Multidongle: Hardware abstraction layer for Kneron NPU devices
- Add comprehensive UI framework with node-based pipeline editor
- Add performance estimation and monitoring capabilities
- Add extensive documentation and examples
- Update project structure and dependencies

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-07-04 23:33:16 +08:00

6.4 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

cluster4npu is a high-performance multi-stage inference pipeline system for Kneron NPU dongles. The project enables flexible single-stage and cascaded multi-stage AI inference workflows optimized for real-time video processing and high-throughput scenarios.

Core Architecture

InferencePipeline: Main orchestrator managing multi-stage workflows with automatic queue management and thread coordination
MultiDongle: Hardware abstraction layer for Kneron NPU devices (KL520, KL720, etc.)
StageConfig: Configuration system for individual pipeline stages
PipelineData: Data structure that flows through pipeline stages, accumulating results
PreProcessor/PostProcessor: Flexible data transformation components for inter-stage processing

Key Design Patterns

Producer-Consumer: Each stage runs in separate threads with input/output queues
Pipeline Architecture: Linear data flow through configurable stages with result accumulation
Hardware Abstraction: MultiDongle encapsulates Kneron SDK complexity
Callback-Based: Asynchronous result handling via configurable callbacks

Development Commands

Environment Setup

# Setup virtual environment with uv
uv venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

Running Examples

# Single-stage pipeline
uv run python src/cluster4npu/test.py --example single

# Two-stage cascade pipeline  
uv run python src/cluster4npu/test.py --example cascade

# Complex multi-stage pipeline
uv run python src/cluster4npu/test.py --example complex

# Basic MultiDongle usage
uv run python src/cluster4npu/Multidongle.py

# Complete UI application with full workflow
uv run python UI.py

# UI integration examples
uv run python ui_integration_example.py

# Test UI configuration system
uv run python ui_config.py

UI Application Workflow

The UI.py provides a complete visual workflow:

Dashboard/Home - Main entry point with recent files
Pipeline Editor - Visual node-based pipeline design
Stage Configuration - Dongle allocation and hardware setup
Performance Estimation - FPS calculations and optimization
Save & Deploy - Export configurations and cost estimation
Monitoring & Management - Real-time pipeline monitoring

# Access different workflow stages directly:
# 1. Create new pipeline → Pipeline Editor
# 2. Configure Stages & Deploy → Stage Configuration
# 3. Pipeline menu → Performance Analysis → Performance Panel
# 4. Pipeline menu → Deploy Pipeline → Save & Deploy Dialog

Testing

# Run pipeline tests
uv run python test_pipeline.py

# Test MultiDongle functionality
uv run python src/cluster4npu/test.py

Hardware Requirements

Kneron NPU dongles: KL520, KL720, etc.
Firmware files: fw_scpu.bin, fw_ncpu.bin
Models: .nef format files
USB ports: Multiple ports required for multi-dongle setups

Critical Implementation Notes

Pipeline Configuration

Each stage requires unique stage_id and dedicated port_ids
Queue sizes (max_queue_size) must be balanced between memory usage and throughput
Stages process sequentially - output from stage N becomes input to stage N+1

Thread Safety

All pipeline operations are thread-safe
Each stage runs in isolated worker threads
Use callbacks for result handling, not direct queue access

Data Flow

Input → Stage1 → Stage2 → ... → StageN → Output
     ↓        ↓               ↓        ↓
   Queue   Process        Process   Result
           + Results      + Results  Callback

Hardware Management

Always call initialize() before start()
Always call stop() for clean shutdown
Firmware upload (upload_fw=True) only needed once per session
Port IDs must match actual USB connections

Error Handling

Pipeline continues on individual stage errors
Failed stages return error results rather than blocking
Comprehensive statistics available via get_pipeline_statistics()

UI Application Architecture

Complete Workflow Components

DashboardLogin: Main entry point with project management
PipelineEditor: Node-based visual pipeline design using NodeGraphQt
StageConfigurationDialog: Hardware allocation and dongle assignment
PerformanceEstimationPanel: Real-time performance analysis and optimization
SaveDeployDialog: Export configurations and deployment cost estimation
MonitoringDashboard: Live pipeline monitoring and cluster management

UI Integration System

ui_config.py: Configuration management and UI/core integration
ui_integration_example.py: Demonstrates conversion from UI to core tools
UIIntegration class: Bridges UI configurations to InferencePipeline

Key UI Features

Auto-dongle allocation: Smart assignment of dongles to pipeline stages
Performance estimation: Real-time FPS and latency calculations
Cost analysis: Hardware and operational cost projections
Export formats: Python scripts, JSON configs, YAML, Docker containers
Live monitoring: Real-time metrics and cluster scaling controls

Code Patterns

Basic Pipeline Setup

config = StageConfig(
    stage_id="unique_name",
    port_ids=[28, 32],
    scpu_fw_path="fw_scpu.bin", 
    ncpu_fw_path="fw_ncpu.bin",
    model_path="model.nef",
    upload_fw=True
)

pipeline = InferencePipeline([config])
pipeline.initialize()
pipeline.start()
pipeline.set_result_callback(callback_func)
# ... processing ...
pipeline.stop()

Inter-Stage Processing

# Custom preprocessing for stage input
preprocessor = PreProcessor(resize_fn=custom_resize_func)

# Custom postprocessing for stage output  
postprocessor = PostProcessor(process_fn=custom_process_func)

config = StageConfig(
    # ... basic config ...
    input_preprocessor=preprocessor,
    output_postprocessor=postprocessor
)

Performance Considerations

Queue Sizing: Smaller queues = lower latency, larger queues = higher throughput
Dongle Distribution: Spread dongles across stages for optimal parallelization
Processing Functions: Keep preprocessors/postprocessors lightweight
Memory Management: Monitor queue sizes to prevent memory buildup

6.4 KiB Raw Blame History