CV1800B, Baremetal

So yes, this is definitely a cache coherency problem. And while I’m quite familiar with RISC-V, I’m not too much with the multi-core aspects of it. But I know that the RISC-V specs define very little in that area apart from basic principles and it’s really up to CPU/SoC designers. So it’s not all identical either from one SoC to the next.

There’s apparently no cache coherency feature on this SoC, so we seem to have to handle it all manually, but since there’s a lot of missing documentation, there may be missing stuff. I have also looked at the C906 specs (open), I’ll have another look yet.

What I did was really to just confirm the issue - having to clear the cache (trigger write-back of all dirty entries) on the writing side and invalidate entries on the reading side is like using a hammer to kill a fly. I hope there’s less heavyweight ways of dealing with this on the CV1800B. I have tried just using fence instructions, but that wasn’t enough. I know there are cache instructions that can clear/invalidate only some specific cache lines rather than the whole cache, but since each line is 64 bytes, for any large enough buffer to share, that would require calling many such instructions in a row. That doesn’t look very pretty.

No, one can’t use the PMP for this. The PMP only deals with memory protection AFAIK. For marking areas not cacheable, that would be the MMU - and, as far as several of us have gotten it, the second core doesn’t have any MMU. So, no luck. Anyway, marking the shared area non-cacheable would make things pretty inefficient (directly accessing DDR RAM without caching doesn’t look too appealing).

As to setting the cache to write-through, it’s not possible. From the C906 specs, its D-Cache doesn’t support write-through, only write-back…

Even so, that still wouldn’t handle the reading side, which also needs its cache to be invalidated when data is modified. And if there isn’t any cache coherency mechanism on this SoC, we’re out of luck and will have to handle it all manually.

Of course, if anyone has done this and knows more about how to handle cache coherency properly on this SoC, please help! :slight_smile:

2 Likes

So as a follow-up, unless someone has more details about this SoC implementation, I’ll definitely conclude that while both cores can “freely” access all DDR RAM (minus whatever may be protected by the PMP), there is indeed absolutely no cache coherency between the two. So, as I said earlier, this has to be handled manually.

Another point I figured out is that both the CLINT and PLIC registers are separate for both cores (so they don’t share the PLIC as could be expected in a multi-core SoC), but mapped at the same addresses. Which is kind of confusing. But that’s what it is. Tried and tested. The PLIC for the second core handles fewer interrupts and is thus likely much smaller physically on the chip. Also, the interrupt numbers for the PLIC of the second core are different from those for the first core, and they’re not listed in the CV1800B datasheet, but they are in the SG200x datasheets, and they are the same for the common interrupt sources. So I recommend reading the SG200x datasheet(s) as well.

Back to cache coherency, as I mentioned earlier, there doesn’t seem to be any other way than manually cleaning/invalidating the cache lines that match the memory area that is shared. There are instructions (Xthead) for that, either for the whole data cache, or for specific lines. If the shared area is as large, or larger than the data cache size, I recommend using the full-cache clean/invalidate instructions, that’s just one instruction and makes sense in that case. But if the shared area is smaller than the data cache, it’s probably more efficient to use the instructions that act on a specific cache line (which take a physical address). As a bonus, I’m copying the C functions that I wrote to cover both cases.

#define CV1800B_DCACHE_LINE_BYTES				64

inline __attribute__((always_inline)) void CV1800B_DCache_CleanAll(void)
{
	__asm__ volatile ("th.dcache.call; th.sync");
}

inline __attribute__((always_inline)) void CV1800B_DCache_InvalidateAll(void)
{
	__asm__ volatile ("th.dcache.iall; th.sync");
}

inline __attribute__((always_inline)) void CV1800B_DCache_CleanPA(void *pMemory, size_t nSize)
{
	uintptr_t nLineAddress = ((uintptr_t) pMemory) & (uintptr_t)(~((uintptr_t)(CV1800B_DCACHE_LINE_BYTES - 1)));
	uintptr_t nAddressEnd = (uintptr_t)(((uintptr_t) pMemory) + nSize);
	
	for (; nLineAddress < nAddressEnd; nLineAddress += CV1800B_DCACHE_LINE_BYTES)
		__asm__ volatile ("th.dcache.cpa %0" :  : "r" (nLineAddress));
	
	__asm__ volatile ("th.sync");
}

inline __attribute__((always_inline)) void CV1800B_DCache_InvalidatePA(void *pMemory, size_t nSize)
{
	uintptr_t nLineAddress = ((uintptr_t) pMemory) & (uintptr_t)(~((uintptr_t)(CV1800B_DCACHE_LINE_BYTES - 1)));
	uintptr_t nAddressEnd = (uintptr_t)(((uintptr_t) pMemory) + nSize);
	
	for (; nLineAddress < nAddressEnd; nLineAddress += CV1800B_DCACHE_LINE_BYTES)
		__asm__ volatile ("th.dcache.ipa %0" :  : "r" (nLineAddress));
	
	__asm__ volatile ("th.sync");
}

Tested and works. Hope it can be useful for those who are going baremetal on these chips. The above should work on CV1800B/SG2000/SG2002, and more generally on any C906 core, although you’ll have to check the D-Cache line size, which is defined as 64 (bytes) here. May be a different size on another chip.

3 Likes

Implemented control of a number of peripherals with no major issues, even SDIO.

Trying to implement USB now. The documentation in the datasheets lacks a lot of information - actually, I’m not 100% sure why they bothered to add that part as it’s so partially documented that in itself, it’s mostly useless.

I figured out from the (very partial) list of registers and source code in the SDK that USB on these chips was a DWC2 core. DWC2 documentation itself is not public. You can get it from Synopsys, but you need a customer account apparently. So if you’re not a customer, no luck. But you’ll find more documentation about it in documentation of some other CPUs/MCUs that also use a DWC2 IP (like STM32 MCUs). Unfortunately, this IP can be configured in various ways and even that is not clearly documented for the CV1800B (and SG200x). There are 4 registers that can be read to get some key parameters though (GHWCFG1 to GHWCFG4), such as the number of endpoints (which appears to be 8 out of max. 16).

What is completely undocumented is the internal USB PHY, and something that is causing me headaches at the moment - at least I’m suspecting that’s the cause of my issue here, that I’ll describe below. Source code in the SDK should help (like the dwc2_udc_oth_phy.c source file), but it supports a bunch of architectures and it’s tough to really understand what is specific to the CV1800B.

So the issue I’m running into is that the “core reset” feature hangs. That part will require having worked with DWC2 to even understand it.

To initialize the USB core, at some point you need to soft reset it, setting bit 0 (CSRST, called CSftRst in the CV1800B datasheet) of the GRSTCTL register, and waiting in a loop until it becomes clear (0). Problem is, it never becomes 0. I’ve read of similar issues on other targets and the usual reason is that either the clocks feeding the USB core or PHY are off, or the PHY itself is disabled.

I’ve checked that the USB clocks are enabled (clk_en_1 register, clk_axi4_usb and clk_apb_usb), but with the PHY undocumented and reverse engineering being tough from the source code, I’m not sure about the PHY. I’m suspecting it’s disabled or its clock(s) are. I’m sure I’m pretty close to solving it, but I’m obviously missing some key information.

If anyone happens to have worked on a DWC2 driver and ideally specifically on this chip, and has any clue, thanks. Not holding my breath too much though.

4 Likes

Some follow-up: figured it out!

Actually, the DWC2 core version on this chip is 4.20a, which I was not expecting. This version changes the way ‘soft core reset’ works and there is an additional flag bit in the GRSTCTL (bit 29, “soft reset done”) that must be checked instead of waiting for bit 0 to clear. This bit 29 being undocumented in the CV1800B datasheet and marked “reserved”, I wasn’t expecting the DWC2 core version to be 4.20a. But reading the GSNPSID register gives: 0x4F54420A. The least significant 16 bits are the core version in ‘BCD’.

One can find this change in the following patch, for instance:

I’m suspecting there may have been silicon revisions of the CV1800B with earlier versions of the DWC2 core.

Anyway, the source code found in the SDK (dwc2_udc_otg.c) deals with it (it checks the GSNPSID register) but since it’s very close to other DWC2 driver implementations, I wasn’t even thinking about this core version thing.

In the end, I’m checking the core version and similarly use a different ‘core reset’ check depending on it. For v. 4.20a and above, you need to set bit 0 of GRSTCTL, then wait for bit 29 to become 1 (meaning: core reset done), and finally clear both bits and move on.

It does appear to work. My code makes it enumerate in high-speed by the host. Then the ‘setup’ phase fails, but that’s because I haven’t fully implemented it yet. The sequence of interrupts I get from the USB core seems correct so far though. Looking good! :+1:

1 Like

I’m amazed by your work, and the way you’re documenting your research is quite good, tbh.
I know it doesn’t add anything to the job you’re doing, yet I merely wanted to say it is cool to witness it.

1 Like

Just a quick additional note. I mentioned looking at the SG2000/2002 datasheets for the “pinmux” feature, but it turns out that while it’s very similar to the CV1800B, the registers are not exactly at the same offsets, even for common pins.

But while it’s left undocumented in the CV1800B datasheet (which OTOH clearly states that’s it’s preliminary), it is in an Excel document available here, which is definitely helpful:

1 Like

Impressive work! Glad to hear that you’re having so much success. I’m curious if you’re planning to make your code available somewhere (like github)?

1 Like

I might. We’ll see. :slight_smile:

3 Likes

Very impressed by your endeavors, I am very tempted to try running the main core bare metal as well, but I don’t have that much experience with that.
Have you tried running from the SRAM directly to avoid DRAM latency and complications? Looking at the FSBL code I’ve seen multiple areas (TPL looks to be 256KB for the CV181x, the 180x also has VC), decent enough for running the core as a microcontroller.

1 Like

I’m curious to try MicroPython on the second core, tbh.

1 Like