I've nerdsniped myself into finding faster approximation of cube root (needed for Lab #color space)
- polynomial approximations alone are not sufficient for x < 0.2. Not even big ones, blended, nor Padé.
- bit-twiddling tricks don't vectorize, and/or need exp2 that is expensive itself.In the end I've found that a simple polynomial + 2× Halley's Rational Method is unbeatable. Precise. Autovectorizes nicely. 2.5× faster than std. 5× faster in batches of four.