mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-08-29 18:26:37 +08:00

History

Anupam Mediratta 607c300e18 Add Direct Preference Optimization (DPO) method Fixes #513 Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example. * Add DPO Functions: Add `get_batched_logps` and `dpo_loss` functions to `llms/mlx_lm/utils.py` for DPO implementation. * Update Training Logic: Update `llms/mlx_lm/tuner/trainer.py` to include DPO-specific training logic, including a new `dpo_loss` function and condition to check for DPO loss in the training loop. * Add Configuration Options: Add configuration options for DPO in `llms/mlx_lm/examples/lora_config.yaml`. * Update Documentation: Update `llms/mlx_lm/README.md` to include instructions for using DPO. * Add Unit Tests: Add `llms/tests/test_dpo.py` with unit tests for `get_batched_logps`, `dpo_loss`, and DPO-specific training logic. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/ml-explore/mlx-examples/issues/513?shareId=XXXX-XXXX-XXXX-XXXX).		2025-02-12 15:21:21 +05:30
..
chat.py	rm temp argument (#1267 )	2025-02-09 11:39:11 -08:00
generate_response.py	fix encoding with special tokens + chat template (#1189 )	2025-01-03 10:50:59 -08:00
lora_config.yaml	Add Direct Preference Optimization (DPO) method	2025-02-12 15:21:21 +05:30
merge_config.yaml	Support for slerp merging models (#455 )	2024-02-19 20:37:15 -08:00
pipeline_generate.py	fix deepseek sharding (#1242 )	2025-02-03 16:59:50 -08:00