mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-08-21 12:06:51 +08:00

History

mah-chey' \| /ˈmɑː.tʃeɪ/ \| /ˈmat͡ɕɛj/ 5adbd358b5 Add DeciLM/Nemotron-NAS architecture support for MLX This commit introduces native MLX support for DeciLM models, including NVIDIA's Nemotron series that use Neural Architecture Search (NAS) optimizations. Key features: - Support for dummy layers (no-op attention/FFN components) - FFN fusion for improved efficiency - Variable Grouped Query Attention (VGQA) with different KV heads per layer - Block configuration handling for NAS architectures - Full conversion pipeline from HuggingFace to MLX format - Comprehensive test suite Tested with: - nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (Q5: 3.86 tokens/sec on M3 Ultra) - Memory usage: ~175GB peak for 253B model This enables running massive NAS-optimized models on Apple Silicon that were previously incompatible with MLX due to their unique architecture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>		2025-07-02 05:59:09 +02:00
..
tests	Add DeciLM/Nemotron-NAS architecture support for MLX	2025-07-02 05:59:09 +02:00
__init__.py	Add DeciLM/Nemotron-NAS architecture support for MLX	2025-07-02 05:59:09 +02:00
convert.py	Add DeciLM/Nemotron-NAS architecture support for MLX	2025-07-02 05:59:09 +02:00
decilm.py	Add DeciLM/Nemotron-NAS architecture support for MLX	2025-07-02 05:59:09 +02:00
README.md	Add DeciLM/Nemotron-NAS architecture support for MLX	2025-07-02 05:59:09 +02:00

README.md

DeciLM / Nemotron-NAS Support for MLX

This module provides native MLX support for DeciLM architecture models, including NVIDIA's Nemotron series. DeciLM uses Neural Architecture Search (NAS) to create highly optimized transformer variants that achieve superior performance through architectural innovations.

Architecture Features

DeciLM uses Neural Architecture Search (NAS) optimization with:

Dummy Layers: Layers where attention or FFN components are completely removed
FFN Fusion: Multiple sequential FFN layers fused into wider parallel layers
Variable Grouped Query Attention (VGQA): Different number of KV heads per layer (1-8)

Supported Models

nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
nvidia/Llama-3_1-Nemotron-51B-Instruct
Other DeciLM-based models

Usage

Converting Models

python convert.py \
    --hf-path nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 \
    --mlx-path ./nemotron-253b-mlx \
    --quantize --q-bits 5

Loading and Generation

from mlx_lm import load, generate
from decilm import Model, DeciLMArgs

# Load pre-converted model
model, tokenizer = load("./nemotron-253b-mlx")

# Generate text
response = generate(
    model, 
    tokenizer, 
    prompt="Explain quantum computing in simple terms",
    max_tokens=500,
    temperature=0.7,
    verbose=True
)
print(response)

Command Line Usage

# Using mlx_lm CLI
mlx_lm.generate \
    --model ./nemotron-253b-mlx \
    --prompt "Your prompt here" \
    --max-tokens 1000 \
    --temperature 0.8

# Start API server
mlx_lm.server \
    --model ./nemotron-253b-mlx \
    --host 0.0.0.0 \
    --port 8080

Implementation Details

The implementation handles:

Block configurations with variable architectures
Dummy layer passthrough (no computation)
FFN fusion for improved efficiency
Per-layer attention head configuration

Performance

Tested on Mac Studio M3 Ultra (512GB RAM):

Nemotron-253B Q5: ~3.86 tokens/sec generation
Memory usage: ~175GB peak

LM Studio Compatibility

⚠️ Note: DeciLM models are currently NOT compatible with LM Studio due to the NAS architecture with dummy layers. LM Studio expects standard transformer layers and encounters "NoneType object has no attribute 'shape'" errors with dummy components.

Use mlx_lm CLI tools instead:

# Generate text
uv run mlx_lm.generate \
  --model /path/to/nemotron-mlx \
  --prompt "Your prompt here" \
  --max-tokens 1000

# Start server
uv run mlx_lm.server \
  --model /path/to/nemotron-mlx \
  --host 0.0.0.0 \
  --port 8080

Tokenizer Issues

If you encounter tokenizer issues, check the USE-IF-MODEL-FAILED-TO-GENERATE subfolder in the model directory for patched tokenizer configs and chat templates.

Requirements

MLX: >= 0.26.1
Python: 3.11 - 3.12 (tested with CPython 3.12.11 via uv)
Memory: Sufficient RAM for model size (e.g., ~175GB for Nemotron-253B)
mlx-lm: Latest version for model inference

Production Deployment

For production-grade API deployment, consider using lbrxServer:

Robust API endpoints for various LLM architectures
Native support for DeciLM/Nemotron models
Built-in load balancing and request queuing
Compatible with OpenAI API format

Model Availability

Pre-converted DeciLM models for MLX:

LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 - 253B Q5 quantized

Testing

Run the test suite:

cd tests
python -m pytest test_decilm.py -v

For integration testing with a real model:

python test_generation.py --model-path /path/to/decilm-model

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

This module follows the same license as mlx-examples. Model weights are subject to their original licenses.