I’d like to use the vector extension of RiscV on my Duo 256M, but I only get an error message saying ‘illegal instruction’. The isa of the processor confirms that the vector extension is present, but I can’t manage to execute the code.
How can I add the vector libraries on my board?
Or is it more complicated and do I need to make a new image, compatible with vector computing?
My application is neural network, but maybe there are other solutions to accelerate the inference?
Hey - thanks for the pointer to the RISC-V vector stuff which I’d been searching for today. I built the test code above and tried it on my Duo 64 with an SDK V2 kernel and it works just fine.
I’m going to see if I can get SIMD stuff working with pffft to speed it up a bit.
Digging deeper into pffft it looks like porting it to RISC-V vector extensions is not going to be feasible due to the dynamic sizing of RVV types. More on that here:
The problem is that pffft relies on 4-element vectors, statically sized. That’s not a problem for all the other vector architectures that it supports, but it assumes the size and uses structs and unions based on that size. Because RVV type sizes are not known at compile time they can’t be used as elements of structs and unions and there is no way to get around this in pffft without rewriting it from scratch.
Did you succeed in the execution of the vectorized code on the Duo?
I see you have a Duo 64, so it’s not the same chip as the Duo 256. Does the image on the Duo 64 support the vector extension?
Hmmm, I see here that the RISC-V CPU is the same on both duos. Only the Arm is not present on the Duo 64. So did you execute the vector code on the Duo 64?
Good for you. Why can’t I manage to run it on the Duo256 without an ‘illegal instruction’ error then? Is there an image for the Duo256 which allows for vector instructions?
Thanks for your help. I’m sorry but I’m abroad for one week, and far from the board. I think I use the official image for the Duo256. I’ll check when I’m back.
I’ve been experimenting more with vectors using the RVV intrinsics - it’s basically working but I thought I’d drop some references here that might be helpful.
First, there are a lot of vector instructions - when you cover all the different data formats (various bitwidths, integers, fixed and floating point, etc) it adds up. The document that covers what all is available is here:
Note that document doesn’t tell you what the actual syntax for the intrinsic is - for that you need a different set of documents:
On top of that, even these docs don’t give you the exact syntax, but they’ll get you close - basically you remove the __riscv_ from the beginning of the names an you’ll be fine. But then I’ve found a few cases where the number of parameters doesn’t match up and then you need to look into the riscv_vector.h file for more clues.
But my initial wonder is about how to execute the vectorized code on the Duo 256. My image doesn’t support the vector extension. So where can I find such an image with vector support, or how can I make one myself?
Grab one of those for your board, flash it to a micro SD card and then try your vector code out on that.
I assume you’ve also got a recent install of ‘duo-examples’ that includes the host-tools for building user-space apps. That’s what I’m using for my tests.
I’ve been testing out some more complex vector stuff and built a “test harness” that lets me write scalar and vector functions to compare their math and timing. Interestingly, I’m not seeing huge gains - for example here’s a pair of functions that do exactly the same thing - complex multiplication of Re,Im interleaved data and I’m seeing only about a 25-30% speedup for the vector operations:
/*
* this is the original non-vector code
*/
void scalar_func1(float *pSrc, int len)
{
for(int i = 0 ; i < len ; i++)
{
float re = pSrc[2*i] * twiddles[2*i] - pSrc[2*i+1] * twiddles[2*i+1];
float im = pSrc[2*i] * twiddles[2*i+1] + pSrc[2*i+1] * twiddles[2*i];
pSrc[2*i] = re;
pSrc[2*i+1] = im;
}
}
/*
* this is the vectorized version - does it match?
*/
void vector_func1(float *pSrc, int len)
{
size_t i = 0;
while(i < len)
{
/* how many lanes to use this pass */
size_t vl = vsetvl_e32m1(len - i);
/* load real & imag w/ stride of 2 for interleaved */
vfloat32m1_t va = vlse32_v_f32m1(pSrc + 2*i, 2*sizeof(float), vl);
vfloat32m1_t vb = vlse32_v_f32m1(pSrc + 2*i + 1, 2*sizeof(float), vl);
vfloat32m1_t vc = vlse32_v_f32m1(twiddles + 2*i, 2*sizeof(float), vl);
vfloat32m1_t vd = vlse32_v_f32m1(twiddles + 2*i + 1, 2*sizeof(float), vl);
/* multiplies */
vfloat32m1_t vac = vfmul_vv_f32m1(va, vc, vl);
vfloat32m1_t vbd = vfmul_vv_f32m1(vb, vd, vl);
vfloat32m1_t vad = vfmul_vv_f32m1(va, vd, vl);
vfloat32m1_t vbc = vfmul_vv_f32m1(vb, vc, vl);
/* sums */
vfloat32m1_t vre = vfsub_vv_f32m1(vac, vbd, vl);
vfloat32m1_t vim = vfadd_vv_f32m1(vad, vbc, vl);
/* store real & imag w/ stride for 2 for interleaved */
vsse32_v_f32m1(pSrc + 2*i, 2*sizeof(float), vre, vl);
vsse32_v_f32m1(pSrc + 2*i + 1, 2*sizeof(float), vim, vl);
i += vl;
}
}
I’ll be checking out some more extensive operations, but these kinds of gains are not as impressive as I’d hoped. Looking at the disassembled code there’s quite a lot of non-vector stuff needed in the looping and addressing though and that overhead may be skewing the results on simple operations like this. It may also be that using C intrinsics is leaving some performance on the table - I’ve seen suggestions from more experienced folks that it’s best to use assembly to get the best gains so I may look into that.