I tried with vl = 32 and m8 extensions, this provides a little higher speedup.
You can’t use arbitrary values for vl - that’s the max number of lanes available in hardware and in these SoCs it’s maxed out at 4 lanes. If you notice in the earlier code you’ll see this function:
/* how many lanes to use this pass */
size_t vl = vsetvl_e32m1(len - i);
that takes the min() of the argument and the max HW lanes (4) so you always use as many as possible.
Thanks for the advice, I’m new to this.
However, are you 100% sure about this? When I force vl to 32, it works also and I get the same results.
You can certainly ask for 32 lanes, but the hardware only has 4 if you’re doing single-precision floating point so you won’t get more than that. The purpose of the vsetvl_e32m1() function is to let you know what the hardware can do so you can index through your data vector effectively. If you don’t care then I guess you can YOLO it and hope for the best.