mlx-examples/llms/decilm/tests
mah-chey' | /ˈmɑː.tʃeɪ/ | /ˈmat͡ɕɛj/ 5adbd358b5 Add DeciLM/Nemotron-NAS architecture support for MLX
This commit introduces native MLX support for DeciLM models, including NVIDIA's
Nemotron series that use Neural Architecture Search (NAS) optimizations.

Key features:
- Support for dummy layers (no-op attention/FFN components)
- FFN fusion for improved efficiency
- Variable Grouped Query Attention (VGQA) with different KV heads per layer
- Block configuration handling for NAS architectures
- Full conversion pipeline from HuggingFace to MLX format
- Comprehensive test suite

Tested with:
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (Q5: 3.86 tokens/sec on M3 Ultra)
- Memory usage: ~175GB peak for 253B model

This enables running massive NAS-optimized models on Apple Silicon that were
previously incompatible with MLX due to their unique architecture.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-02 05:59:09 +02:00
..
test_decilm.py Add DeciLM/Nemotron-NAS architecture support for MLX 2025-07-02 05:59:09 +02:00