Vector computing on Duo 256M

Lesept · September 2, 2025, 2:24pm

I’d like to use the vector extension of RiscV on my Duo 256M, but I only get an error message saying ‘illegal instruction’. The isa of the processor confirms that the vector extension is present, but I can’t manage to execute the code.

How can I add the vector libraries on my board?

Or is it more complicated and do I need to make a new image, compatible with vector computing?

My application is neural network, but maybe there are other solutions to accelerate the inference?

Thanks for your help and advices.

hannahKobain · September 2, 2025, 3:45pm

Which instruction? There’s a chance they aren’t enabled in the provided OS image.

Lesept · September 3, 2025, 9:45am

Here is a short example to begin with:

#include <riscv_vector.h>#include <stdio.h>
int main() {
  size_t vl = vsetvl_e32m1(8);
  printf(“Test VL: %zu\n”, vl);
  return 0;
}

Then, my code is:

static inline float vector_dot_product(const float *a, const float *b, size_t n) {
if (n == 0) return 0.0f;

/* vlmax = min(n, implementation VLmax) */
size_t vlmax = vsetvl_e32m1(n);

/* vector accumulator initialized to zero (uses vlmax lanes) */
vfloat32m1_t vacc = vfmv_v_f_f32m1(0.0f, vlmax);

size_t i = 0;
while (i < n) {
    size_t vl = vsetvl_e32m1(n - i);
    vfloat32m1_t va = vle32_v_f32m1(a + i, vl);
    vfloat32m1_t vb = vle32_v_f32m1(b + i, vl);
    /* vacc += va * vb */
    vacc = vfmacc_vv_f32m1(vacc, va, vb, vl);
    i += vl;
}

/* Reduce accumulator by storing lanes to memory and summing them */
float *tmp = (float *)malloc(vlmax * sizeof(float));

/* store vacc lanes to tmp (vl = vlmax) */
vse32_v_f32m1(tmp, vacc, vlmax);

float result = 0.0f;
for (size_t j = 0; j < vlmax; ++j) result += tmp[j];

free(tmp);
return result;
}

How can I know which instructions are enabled or not?

emeb · September 14, 2025, 10:02pm

Hey - thanks for the pointer to the RISC-V vector stuff which I’d been searching for today. I built the test code above and tried it on my Duo 64 with an SDK V2 kernel and it works just fine.

I’m going to see if I can get SIMD stuff working with pffft to speed it up a bit.

Lesept · September 15, 2025, 7:45am

Hi emeb

That’s good news. How did you build it?

What do you mean work the SIMD and ‘pffft’ ? (Sorry I’m not native English speaker)

hannahKobain · September 15, 2025, 8:23am

xFFT usually stands for a fast Fourier transform algorithm variant.

Lesept · September 15, 2025, 12:56pm

Thanks, I know FFT, but @emeb said ‘pffft’ with 3 ‘f’.

So in this case, what does ‘pf’ mean ?

EDIT: ok ‘pretty fast FFT’.

emeb · September 15, 2025, 3:53pm

Yes - pffft is an open-source FFT library. Original repo for that is here: Bitbucket

I built the test code using the Duo Examples repo. Here’s my version with the above vector test code: duo-examples/vec at main · emeb/duo-examples · GitHub

edit: I did disassemble that after compiling and confirmed that it actually used the vector extensions.

I’ve also got my modified version of pffft elsewhere in that repo while I work on it.

emeb · September 15, 2025, 8:31pm

Digging deeper into pffft it looks like porting it to RISC-V vector extensions is not going to be feasible due to the dynamic sizing of RVV types. More on that here:

The problem is that pffft relies on 4-element vectors, statically sized. That’s not a problem for all the other vector architectures that it supports, but it assumes the size and uses structs and unions based on that size. Because RVV type sizes are not known at compile time they can’t be used as elements of structs and unions and there is no way to get around this in pffft without rewriting it from scratch.

Sigh…

Lesept · September 15, 2025, 10:27pm

Did you succeed in the execution of the vectorized code on the Duo?

I see you have a Duo 64, so it’s not the same chip as the Duo 256. Does the image on the Duo 64 support the vector extension?

Hmmm, I see here that the RISC-V CPU is the same on both duos. Only the Arm is not present on the Duo 64. So did you execute the vector code on the Duo 64?

emeb · September 16, 2025, 3:36am

I have run the vector code on both Duo 64 and Duo S (in RISC-V mode). It runs without errors and generates the expected result in both cases.

Lesept · September 16, 2025, 4:45am

Good for you. Why can’t I manage to run it on the Duo256 without an ‘illegal instruction’ error then? Is there an image for the Duo256 which allows for vector instructions?

emeb · September 16, 2025, 1:23pm

Which version of the OS are you running? I’m using the SDK V2 OS.

Lesept · September 18, 2025, 6:50pm

Thanks for your help. I’m sorry but I’m abroad for one week, and far from the board. I think I use the official image for the Duo256. I’ll check when I’m back.

emeb · October 4, 2025, 1:06am

I’ve been experimenting more with vectors using the RVV intrinsics - it’s basically working but I thought I’d drop some references here that might be helpful.

First, there are a lot of vector instructions - when you cover all the different data formats (various bitwidths, integers, fixed and floating point, etc) it adds up. The document that covers what all is available is here:

Note that document doesn’t tell you what the actual syntax for the intrinsic is - for that you need a different set of documents:

On top of that, even these docs don’t give you the exact syntax, but they’ll get you close - basically you remove the __riscv_ from the beginning of the names an you’ll be fine. But then I’ve found a few cases where the number of parameters doesn’t match up and then you need to look into the riscv_vector.h file for more clues.

Hope that helps.

Lesept · October 5, 2025, 2:37pm

Thanks a lot, that will help!

But my initial wonder is about how to execute the vectorized code on the Duo 256. My image doesn’t support the vector extension. So where can I find such an image with vector support, or how can I make one myself?

emeb · October 5, 2025, 7:18pm

You can find pre-built images for SDK V2 here: Releases · milkv-duo/duo-buildroot-sdk-v2 · GitHub

(I found the link to that on the main Milk-V site here: Resource Download Summary | Milk-V )

Grab one of those for your board, flash it to a micro SD card and then try your vector code out on that.

I assume you’ve also got a recent install of ‘duo-examples’ that includes the host-tools for building user-space apps. That’s what I’m using for my tests.

emeb · October 5, 2025, 7:32pm

I’ve been testing out some more complex vector stuff and built a “test harness” that lets me write scalar and vector functions to compare their math and timing. Interestingly, I’m not seeing huge gains - for example here’s a pair of functions that do exactly the same thing - complex multiplication of Re,Im interleaved data and I’m seeing only about a 25-30% speedup for the vector operations:

/*
 * this is the original non-vector code
 */
void scalar_func1(float *pSrc, int len)
{
	for(int i = 0 ; i < len ; i++)
	{
		float re = pSrc[2*i] * twiddles[2*i] - pSrc[2*i+1] * twiddles[2*i+1];
		float im = pSrc[2*i] * twiddles[2*i+1] + pSrc[2*i+1] * twiddles[2*i];
		pSrc[2*i] = re;
		pSrc[2*i+1] = im;
	}
}

/*
 * this is the vectorized version - does it match?
 */
void vector_func1(float *pSrc, int len)
{
	size_t i = 0;
	while(i < len)
	{
		/* how many lanes to use this pass */
		size_t vl = vsetvl_e32m1(len - i);
		
		/* load real & imag w/ stride of 2 for interleaved */
		vfloat32m1_t va = vlse32_v_f32m1(pSrc + 2*i, 2*sizeof(float), vl);
		vfloat32m1_t vb = vlse32_v_f32m1(pSrc + 2*i + 1, 2*sizeof(float), vl);
		vfloat32m1_t vc = vlse32_v_f32m1(twiddles + 2*i, 2*sizeof(float), vl);
		vfloat32m1_t vd = vlse32_v_f32m1(twiddles + 2*i + 1, 2*sizeof(float), vl);
		
		/* multiplies */
		vfloat32m1_t vac = vfmul_vv_f32m1(va, vc, vl);
		vfloat32m1_t vbd = vfmul_vv_f32m1(vb, vd, vl);
		vfloat32m1_t vad = vfmul_vv_f32m1(va, vd, vl);
		vfloat32m1_t vbc = vfmul_vv_f32m1(vb, vc, vl);
		
		/* sums */
		vfloat32m1_t vre = vfsub_vv_f32m1(vac, vbd, vl);
		vfloat32m1_t vim = vfadd_vv_f32m1(vad, vbc, vl);
		
		/* store real & imag w/ stride for 2 for interleaved */
		vsse32_v_f32m1(pSrc + 2*i, 2*sizeof(float), vre, vl);
		vsse32_v_f32m1(pSrc + 2*i + 1, 2*sizeof(float), vim, vl);
		
		i += vl;
	}
}

I’ll be checking out some more extensive operations, but these kinds of gains are not as impressive as I’d hoped. Looking at the disassembled code there’s quite a lot of non-vector stuff needed in the looping and addressing though and that overhead may be skewing the results on simple operations like this. It may also be that using C intrinsics is leaving some performance on the table - I’ve seen suggestions from more experienced folks that it’s best to use assembly to get the best gains so I may look into that.

[edit] - The github repo for my test code is here: duo-examples/try_vec at main · emeb/duo-examples · GitHub

Lesept · October 5, 2025, 7:53pm

Thanks for your help, it’s great! I’ll have a look very soon.