* Split reduction files to reduce compile times
* Add small and medium axis size specializations for row reductions
* Add non-row-reduction options for small and med kernels
* added leaky_relu relu6 softplus elu celu logsigmoid
* minor fixes for docstring and benchmark imports
* fixed elu implementation and added tests
* added tests for optional param, changed leaky_relu param to fit pytorch documentation