mlx.optimizers.Adafactor#
- class Adafactor(learning_rate: float | Callable[[array], array] | None = None, eps: Tuple[float, float] = (1e-30, 0.001), clip_threshold: float = 1.0, decay_rate: float = -0.8, beta_1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, warmup_init: bool = False)#
 The Adafactor optimizer.
Our Adafactor implementation follows the original paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
- Parameters:
 learning_rate (float or callable, optional) – The learning rate. Default:
None.eps (tuple(float, float), optional) – The first term \(\epsilon_1\) added to the square of the gradients to improve numerical stability and the second term \(\epsilon_2\) is used for parameter scaling if
parameter_scaleis set toTrue. Default:(1e-30, 1e-3).clip_threshold (float, optional) – Clips the unscaled update at
clip_threshold. Default:1.0.decay_rate (float, optional) – Coefficient for the running average of the squared gradient. Default:
-0.8.beta_1 (float, optional) – If set to a value bigger than zero then first moment will be used. Default:
None.weight_decay (float, optional) – The weight decay \(\lambda\). Default:
0.0.scale_parameter (bool, optional) – If set to
Truethe learning rate will be scaled by \(\max(\epsilon_1, \text{RMS}(w_{t-1}))\). Default:True.relative_step (bool, optional) – If set to
Truethelearning_ratewill be ignored and relative step size will be computed. Default:True.warmup_init (bool, optional) – If set to
Truethen the relative step size will be calculated by the current step. Default:False.
Methods
__init__([learning_rate, eps, ...])apply_single(gradient, parameter, state)Performs the Adafactor parameter and state update.
init_single(parameter, state)Initialize optimizer state