forked from masonhuang/cluster4npu
Key improvements: - Add timeout mechanism (2s) for result ordering to prevent slow devices from blocking pipeline - Implement performance-biased load balancing with 2x penalty for low-GOPS devices (< 10 GOPS) - Adjust KL520 GOPS from 3 to 2 for more accurate performance representation - Remove KL540 references to focus on available hardware - Add intelligent sequence skipping with timeout results for better throughput This resolves the issue where multi-series mode had lower FPS than single KL720 due to KL520 devices creating bottlenecks in the result ordering queue. Performance impact: - Reduces KL520 task allocation from ~12.5% to ~5-8% - Prevents pipeline stalls from slow inference results - Maintains result ordering integrity with timeout fallback 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>