mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-08-29 13:01:53 +08:00
![]() Fixes #513 Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example. * **Add DPO Functions**: Add `get_batched_logps` and `dpo_loss` functions to `llms/mlx_lm/utils.py` for DPO implementation. * **Update Training Logic**: Update `llms/mlx_lm/tuner/trainer.py` to include DPO-specific training logic, including a new `dpo_loss` function and condition to check for DPO loss in the training loop. * **Add Configuration Options**: Add configuration options for DPO in `llms/mlx_lm/examples/lora_config.yaml`. * **Update Documentation**: Update `llms/mlx_lm/README.md` to include instructions for using DPO. * **Add Unit Tests**: Add `llms/tests/test_dpo.py` with unit tests for `get_batched_logps`, `dpo_loss`, and DPO-specific training logic. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/ml-explore/mlx-examples/issues/513?shareId=XXXX-XXXX-XXXX-XXXX). |
||
---|---|---|
.. | ||
test_datsets.py | ||
test_dpo.py | ||
test_finetune.py | ||
test_generate.py | ||
test_gguf.py | ||
test_models.py | ||
test_prompt_cache.py | ||
test_sample_utils.py | ||
test_server.py | ||
test_tokenizers.py | ||
test_tuner_utils.py | ||
test_utils_load_model.py | ||
test_utils.py |