
When we lose a running pod (possibly loss of spot instance) or encounter some other infrastructure-related failure of this job, we need to retry it. This retries the job the maximum number of times in those cases.
When we lose a running pod (possibly loss of spot instance) or encounter some other infrastructure-related failure of this job, we need to retry it. This retries the job the maximum number of times in those cases.