[Openpvrsgx-devgroup] [PATCH] Update pvrsgx 1.14.3759903 to latest kernel

Mon Nov 15 07:46:40 CET 2021

Hi Lucas,

> Am 14.11.2021 um 19:58 schrieb Lucas Fryzek <lucas.fryzek at hazeco.xyz>:
> 
> 
>>> FYI: I looked into
>>> git diff --no-index drivers/gpu/drm/pvrsgx/1.14.3699939/eurasia_km/services4/srvkm/env/linux/osfunc.c drivers/gpu/drm/pvrsgx/1.14.3759903/eurasia_km/services4/srvkm/env/linux/osfunc.c
>>> and there seems to be some different fixed for 1.14.3699939.
>>> Mainly in OSAcquirePhysPageAddr() which uses
>>> -    psInfo->iNumPagesMapped = get_user_pages_remote(current->mm, uStartAddr, psInfo->iNumPages, FOLL_WRITE, psInfo->ppsPages, NULL, NULL);
>>> since LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
>>> instead of
>>> +    psInfo->iNumPagesMapped = get_user_pages(
>>> +               uStartAddr, psInfo->iNumPages, 1, psInfo->ppsPages, NULL);
>>> But don't ask me what the difference is and if it is the root cause of my kernel panic...
>>> BR,
>>> Nikolaus
>> Thanks for the all the additional information! I was able to trigger the crash by compiling the driver as a module (i.e. `CONFIG_SGX_JZ4780=m`), it appears that the default letux_defconfig doesn't do this. I am now seeing the same behavior, and trying to debug it. I don't believe `OSAcquirePhysPageAddr` is causing the issue here, I tried taking the changes from `1.14.3699939` and that didn't help. I am going to figure out which value triggers the paging exception and work backwards from there to see whats going wrong.
>> Also I did have a case where it decided to boot up fine without an error, and I saw the same results with `pvrsrvctrl` reporting that it was happy with the KMOD and then triggering a kernel crash.
> 
> After more investigation it appears that there are two problems:
> 
> First problem is that when the driver fails to init on boot it appears that `struct device *dev` pointer coming from `PVRLDMGetDevice` is returning NULL.

Ok.

> The second problem is that `dma_sync_single_for_device` does appear to be happy with the address being given to it. I suspect this is a problem with how the driver is allocating memory. The `dma_sync_*` APIs are intended to be used with memory allocated from `dma_alloc_*`
> 
> In the code I can see this comment which originally from the unmodified kernel code that IMGTEC shipped
> 
> ```
> /*
> * dmac cache functions are supposed to be used for dma
> * memory which comes from dma-able memory. However examining
> * the implementation of dmac cache functions and experimenting,
> * can assert that dmac functions are safe to use for high-mem
> * memory as well for our OS{Clean/Flush/Invalidate}Cache functions
> *
> */
> ```
> 
> I suspect this comment is no longer true, and if we want to use the DMA cache ops we need to allocate memory using the dma allocate functions, instead of using `vmalloc` or a modified version of `vmalloc`. This seems to be trivial change though as its difficult to tell if `OSAllocPages_Impl` is just used for DMA memory, or just kernel memory as well.

If I remember correctly there have ben several improvements in how caches and dma interoperate - and there have been quite recent updates to the ingenic cache system.

> There is probably a viable strategy of using the mips platform specific cache operations in `include/asm/cache*` instead of the `dma` ones. I don't think this would be a problem since the code for section is already wrapped in a `#elif defined(__mips__)`. Although if the goal of this project is to get this kernel module upstreamed, I suspect the kernel maintainers would prefer if we didn't use platform specific `ifdefs` and instead used the common kernel infrastructure that is platform independent.

Well, at the moment we just need a working driver... It is very unclear on what upstreaming can eventually be based on. Maybe the ddk 1.17 version? Or a complete rewrite? As all these CamelCase functions and variables are understandable from the IMG perspective. They provided SGX support for multiple OS for example Windows or the iPhone 4 with their Darwin kernel. So IMG has a full package following their conventions and this is wrapped to the specific OS interfaces. But if we want to include everything, it is miles away from Linux conventions.

And I would as a first step try to modify the SoC dependent stuff to a generic driver. And also remove all compile time options and errata fixes handled by #ifdef and convert them into a big quirks table based on the .compatible string. Maybe even a mix between generic sgx530/40/44 compatible and SoC specific ones.

But that is (far) future. At the moment I think we all would be happy if we can load firmware on the jz4780 just to have it also working on a non-ARM platform for better comparisons. More RE will show what the best way for upstreaming looks like...

> 
> For fun I just removed the call to `dma_sync_single_for_device` to see what would happen and the error went away, allowing `pvrsrvctl` to get further along until it printed these errors
> 
> ```
> [   89.495046] PVR_K: (FAIL) SGXInit: Mismatch in client-side and KM driver build options.
> [   89.531664] PVR_K: Extra options present in client-side driver: (0xa100). Please check sgx_options.h
> [   89.569613] PVR_K:(Error): PVRSRVFinaliseSystem: Failed PVRSRVDevInitCompatCheck call (device index: 0)
> [   89.608249] PVR_K:(Error): BridgedDispatchKM: Initialisation failed. Driver unusable.
> pvrsrvctl: SrvInit failed (already initialized?) (err=PVRSRV_ERROR_BUILD_MISMATCH
> ```
> 
> So after figuring out a fix for these cache operations, looks like next steps is spoofing the driver build options to match what the userland expects :)

That is excellent news!

Yes, making kernel and userland build options match is not very difficult. But we have to take care that the changes are only for jz4780.

BTW:  I'll plan to update the letux kernel tree to 5.16-rc1 this week and also rebase and push all the pvrsgx branches to our git tree.
Then I can take a look at the build options to match user-space.

BR,
Nikolaus