[Letux-kernel] SMP issue between LX16 and LX20 found...

H. Nikolaus Schaller hns at goldelico.com
Sun Jun 15 12:49:05 CEST 2025



> Am 15.06.2025 um 12:07 schrieb H. Nikolaus Schaller <hns at goldelico.com>:
> 
> Hi,
> 
>> Am 14.06.2025 um 23:08 schrieb Paul Boddie <paul at boddie.org.uk>:
>> 
>> On Saturday, 14 June 2025 19:58:46 CEST H. Nikolaus Schaller wrote:
>>> 
>>> So we have the issue that on x1600, ingenic_ost_cevt_cb()
>>> breaks when doing ost->soc_info->version and the pointer
>>> ost->soc_info is "randomized". This means that ost = timer->ost
>>> memory area is overwritten by someone.
>>> 
>>> What is the easiest way to find out who overwrites this?
>> 
>> I think we need a good mental model of how this is supposed to work, which I 
>> don't think I have really focused on until now. It seems to me that both CPUs 
>> (processor cores) access the same physical memory range for the peripherals,
> 
> Yes, they do.
> 
>> so unless there is some kind of extra layer or duplicate sets of some 
>> peripherals, both CPUs are capable of accessing the same registers and 
>> operating on the same peripherals.
>> 
>> So, it seems desirable that only one CPU would set up any given peripheral.
> 
> I think that is what the SMP code in the kernel usually does.
> 
>> Structures allocated during this process would, of course, be accessible later 
>> by both CPUs, but they would only allocated once by a particular CPU. Once 
>> initialised, a given CPU might need to access a peripheral, perhaps in 
>> response to an interrupt, and this would need to be done without disruption 
>> from the other CPU.
> 
> This is usually done by some dev_id or smp_processor_id() like in __request_percpu_irq().
> 
>> Usually, people focus on things like locking primitives when discussing multi-
>> CPU support, which is obviously relevant, but what I am missing is the control 
>> over initialisation, whether it must be done using a single CPU, whether it 
>> could be distributed across both CPUs, and how this is controlled.
> 
> The strange thing all that it works on the dual core CPUs like jz4780 or x2000
> but fails in case there is only a single CPU as in the x1600. This indicates a
> different issue (memory management? cache management? compiler bug?)
> than what you are thinking of.
> 
> BTW: my current x2000 boot often fails like:
> 
> [    0.164566] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> [    0.166649] io scheduler mq-deadline registered
> [    0.168442] io scheduler kyber registered
> [   21.170219] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   21.170872] rcu:     1-....: (1 ticks this GP) idle=073c/1/0x40000002 softirq=8/9 fqs=1050
> [   21.170801] rcu:     (detected by 0, t=2102 jiffies, g=-1171, q=37 ncpus=2)
> [   21.169409] Sending NMI from CPU 0 to CPUs 1:
> [   21.169420] NMI backtrace for cpu 1
> [   21.169427] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT 
> [   21.169437] Hardware name: Letux LX20v0.3
> [   21.169442] $ 0   : 00000000 fffffff0 8100f540 8100f540
> [   21.169457] $ 4   : 00407000 80c08540 04200042 00000001
> [   21.169468] $ 8   : 00000000 ffffffe7 500015a4 00002a44
> [   21.169480] $12   : ffffffff 0000b67e 00000000 00000000
> [   21.169490] $16   : 80b8a4a0 80c08540 80b7c040 0000000a
> [   21.169501] $20   : 80c052f0 ffff8afc 40224f64 d1e00046
> [   21.169512] $24   : 7460f159 8054eb10                  
> [   21.169523] $28   : 818a0000 81813f80 00000080 8004bda0
> [   21.169534] Hi    : 00002a44
> [   21.169537] Lo    : 500015a4
> [   21.169540] epc   : 8004b164 handle_softirqs+0xd4/0x2d4
> [   21.169554] ra    : 8004bda0 irq_exit+0xc8/0x168
> [   21.169561] Status: 14001f03 KERNEL EXL IE 
> [   21.169571] Cause : 00801800 (ExcCode 00)
> [   21.169575] PrId  : 00132000 (Ingenic XBurst II)
> [   21.169580] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT 
> [   21.169588] Hardware name: Letux LX20v0.3
> [   21.169592] Stack : 81813d2c 800a9388 01000000 db3faa92 80b90000 80bbe153 00000000 00000000
> [   21.169613]         00000000 00000000 00000000 00000000 00000000 00000001 81813ce0 81893180
> [   21.169630]         00000000 00000000 80abfbdc 00000000 00000001 81813b3c 00000000 44495020
> [   21.169647]         81813b8a 80c175f8 80c1763c 00000020 80b90000 00000000 00000000 80abfbdc
> [   21.169665]         00000020 80ab7c68 00000030 800263c8 00000000 00000000 00000004 80c10004
> [   21.169683]         ...
> [   21.169689] Call Trace:
> [   21.169691] [<80029cd0>] show_stack+0x38/0x118
> [   21.169707] [<800200d8>] dump_stack_lvl+0x74/0xb0
> [   21.169720] [<8094f7b0>] nmi_cpu_backtrace+0x13c/0x144
> [   21.169732] [<800263d8>] handle_backtrace+0x10/0x54
> [   21.169740] [<801019ec>] __flush_smp_call_function_queue+0x174/0x360
> [   21.169751] [<80021da8>] ingenic_xburst2_mbox_handler+0x94/0xc8
> [   21.169759] [<800b96ec>] handle_percpu_devid_irq+0xc0/0x194
> [   21.169770] [<800b2a90>] handle_irq_desc+0x78/0x90
> [   21.169777] [<80971cb0>] do_IRQ+0x18/0x24
> [   21.169787] [<800241bc>] handle_int+0x140/0x14c
> [   21.169794] [<8004bda0>] irq_exit+0xc8/0x168
> [   21.169802] 
> 
> A successful boot looks like:
> 
> [    0.178235] Key type asymmetric registered
> [    0.182320] Asymmetric key parser 'x509' registered
> [    0.187383] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> [    0.194844] io scheduler mq-deadline registered
> [    0.199411] io scheduler kyber registered
> [    0.205242] ledtrig-cpu: registered to indicate activity on CPUs
> [    0.212132] Serial: 8250/16550 driver, 9 ports, IRQ sharing disabled
> [    0.225323] printk: legacy console [ttyS2] disabled
> [    0.230880] 10032000.serial: ttyS2 at MMIO 0x10032000 (irq = 45, base_baud = 1500000) is a 16550A
> [    0.239960] printk: legacy console [ttyS2] enabled
> [    0.239960] printk: legacy console [ttyS2] enabled
> [    0.249610] printk: legacy bootconsole [x1000_uart0] disabled
> [    0.249610] printk: legacy bootconsole [x1000_uart0] disabled
> [    0.266105] 10033000.serial: ttyS0 at MMIO 0x10033000 (irq = 44, base_baud = 1500000) is a 16550A
> [    0.283180] brd: module loaded
> 
> This could be that some of the spinlocks blocks one second core waiting for the other...
> Removing the printk from __request_percpu_irq() made it work again.
> 
> So we may more likely have a race condition...
> 
> Some additional hint is that I can only boot one of three lx20 boards well enough.
> A fluctuation induced by tiny hardware differences (speed of some subystem initializing)
> could also explain this.
> 
> We would need access to the original x2000 SMP authors but they are no longer active...

We may indeed have a cache or memory issue. Here is a boot log from the same µSD which did once fail
and once boot as above now in a different LX20 unit:


[    0.473419] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support!
[    0.737557] EXT4-fs (mmcblk0p2): recovery complete
[    0.743644] EXT4-fs (mmcblk0p2): mounted filesystem 029581e7-6fd0-452d-a36a-94440b89df1d r/w with journalled data mode. Quota mode: none.
[    0.753386] VFS: Mounted root (ext4 filesystem) on device 179:2.
[    0.783534] devtmpfs: mounted
[    0.783984] Freeing unused kernel image (initmem) memory: 320K
[    0.789642] This architecture does not have kernel memory protection.
[    0.796092] Run /sbin/init as init process
[    0.800129]   with arguments:
[    0.803086]     /sbin/init
[    0.805762]   with environment:
[    0.808882]     HOME=/
[    0.811220]     TERM=linux
Mount failed for selinuxfs on /sys/fs/selinux:  No such file or directory
INIT: version 2.88 booting
[info] Using makefile-style concurrent boot in runlevel S.
[....] Starting the hotplug events dispatcher: udevd[    2.451279] systemd-udevd[578]: starting version 215
^[[c.
[ ok ] Synthesizing the initial hotplug events...done.
^[[c[....] Waiting for /dev to be fully populated...[    3.975971] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:1220: group 37, block bitmap and bg descriptor inconsistent: 32256 vs 32255 free clusters
[    4.063505] Reserved instruction in kernel code[#1]:
[    4.063086] CPU: 0 UID: 0 PID: 578 Comm: udevd Not tainted 6.16.0-rc1-letux-mips+ #3267 PREEMPT 
[    4.063643] Hardware name: Letux LX20v0.3
[    4.064899] $ 0   : 00000000 00000001 00000000 00000001
[    4.064636] $ 4   : 00000000 6a29ecbe 00000000 00000000
[    4.064375] $ 8   : 670b3220 80bb0f90 00000000 6a6be704
[    4.064113] $12   : 7b6854d3 8ec1c2c8 81ff52e0 20a1c866
[    4.063852] $16   : 00000001 80b8a4a0 80b90000 81003c00
[    4.063591] $20   : 00000000 81ff5280 00000000 80b90000
[    4.063329] $24   : 00000000 81ff52e0                  
[    4.063067] $28   : 85366000 85367d18 80c30000 809699e4
[    4.062806] Hi    : 00000000
[    4.062938] Lo    : 00000000
[    4.063065] epc   : 809684ac __schedule+0x88/0x1590
[    4.062457] ra    : 809699e4 schedule+0x30/0xf8
[    4.064232] Status: 14001c03 KERNEL EXL IE 
[    4.062931] Cause : 08800028 (ExcCode 0a)
[    4.064187] PrId  : 00132000 (Ingenic XBurst II)
[    4.063320] Modules linked in:
[    4.063622] Process udevd (pid: 578, threadinfo=(ptrval), task=(ptrval), tls=77eef740)
[    4.063317] Stack : 20a1c866 35942c1a 7d865353 0fe680fe b3b84ed5 6a29ecbe 00000004 80bb0f3c
[    4.063444]         00000000 85367de0 00000040 80bb0f6c 80b90000 80b90000 80c30000 8048a740
[    4.063573]         00000001 00000004 00000000 00000040 80b90000 6a29ecbe 81ff5280 85367de0
[    4.063700]         80b90000 85367de4 80b90000 00000003 80b90000 80b90000 80c30000 809699e4
[    4.063828]         14001c03 00000001 85367de0 80b90000 00000001 8001b928 00000001 6a29ecbe
[    4.063956]         ...
[    4.063654] Call Trace:
[    4.063348] [<809684ac>] __schedule+0x88/0x1590
[    4.062397] [<809699e4>] schedule+0x30/0xf8
[    4.063824] [<8001b928>] try_to_generate_entropy+0x254/0x2a8
[    4.063996] [<805d9090>] urandom_read_iter+0x104/0x10c
[    4.063647] [<801cc448>] vfs_read+0x2a4/0x358
[    4.062520] [<801cd098>] ksys_read+0x90/0x128
[    4.064123] [<80034028>] syscall_common+0x34/0x58
[    4.063343] 
[    4.064814] Code: 30421800  1440024b  00000000 <41606000> 0c0336d3  02c02025  00002825  0c01df15  02602025 
[    4.063598] 
[    4.062396] ---[ end trace 0000000000000000 ]---
[    4.064239] Kernel panic - not syncing: Fatal exception
[    4.066685] Rebooting in 10 seconds..




More information about the Letux-kernel mailing list