* faster contiguous gather for indices in the first axis * work per thread > 1 * angelos suggestion for scales / biases