[Letux-kernel] SMP issue between LX16 and LX20 found...

H. Nikolaus Schaller hns at goldelico.com
Sun Jun 15 12:07:02 CEST 2025


Hi,

> Am 14.06.2025 um 23:08 schrieb Paul Boddie <paul at boddie.org.uk>:
> 
> On Saturday, 14 June 2025 19:58:46 CEST H. Nikolaus Schaller wrote:
>> 
>> So we have the issue that on x1600, ingenic_ost_cevt_cb()
>> breaks when doing ost->soc_info->version and the pointer
>> ost->soc_info is "randomized". This means that ost = timer->ost
>> memory area is overwritten by someone.
>> 
>> What is the easiest way to find out who overwrites this?
> 
> I think we need a good mental model of how this is supposed to work, which I 
> don't think I have really focused on until now. It seems to me that both CPUs 
> (processor cores) access the same physical memory range for the peripherals,

Yes, they do.

> so unless there is some kind of extra layer or duplicate sets of some 
> peripherals, both CPUs are capable of accessing the same registers and 
> operating on the same peripherals.
> 
> So, it seems desirable that only one CPU would set up any given peripheral.

I think that is what the SMP code in the kernel usually does.

> Structures allocated during this process would, of course, be accessible later 
> by both CPUs, but they would only allocated once by a particular CPU. Once 
> initialised, a given CPU might need to access a peripheral, perhaps in 
> response to an interrupt, and this would need to be done without disruption 
> from the other CPU.

This is usually done by some dev_id or smp_processor_id() like in __request_percpu_irq().

> Usually, people focus on things like locking primitives when discussing multi-
> CPU support, which is obviously relevant, but what I am missing is the control 
> over initialisation, whether it must be done using a single CPU, whether it 
> could be distributed across both CPUs, and how this is controlled.

The strange thing all that it works on the dual core CPUs like jz4780 or x2000
but fails in case there is only a single CPU as in the x1600. This indicates a
different issue (memory management? cache management? compiler bug?)
than what you are thinking of.

BTW: my current x2000 boot often fails like:

[    0.164566] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.166649] io scheduler mq-deadline registered
[    0.168442] io scheduler kyber registered
[   21.170219] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   21.170872] rcu:     1-....: (1 ticks this GP) idle=073c/1/0x40000002 softirq=8/9 fqs=1050
[   21.170801] rcu:     (detected by 0, t=2102 jiffies, g=-1171, q=37 ncpus=2)
[   21.169409] Sending NMI from CPU 0 to CPUs 1:
[   21.169420] NMI backtrace for cpu 1
[   21.169427] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT 
[   21.169437] Hardware name: Letux LX20v0.3
[   21.169442] $ 0   : 00000000 fffffff0 8100f540 8100f540
[   21.169457] $ 4   : 00407000 80c08540 04200042 00000001
[   21.169468] $ 8   : 00000000 ffffffe7 500015a4 00002a44
[   21.169480] $12   : ffffffff 0000b67e 00000000 00000000
[   21.169490] $16   : 80b8a4a0 80c08540 80b7c040 0000000a
[   21.169501] $20   : 80c052f0 ffff8afc 40224f64 d1e00046
[   21.169512] $24   : 7460f159 8054eb10                  
[   21.169523] $28   : 818a0000 81813f80 00000080 8004bda0
[   21.169534] Hi    : 00002a44
[   21.169537] Lo    : 500015a4
[   21.169540] epc   : 8004b164 handle_softirqs+0xd4/0x2d4
[   21.169554] ra    : 8004bda0 irq_exit+0xc8/0x168
[   21.169561] Status: 14001f03 KERNEL EXL IE 
[   21.169571] Cause : 00801800 (ExcCode 00)
[   21.169575] PrId  : 00132000 (Ingenic XBurst II)
[   21.169580] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT 
[   21.169588] Hardware name: Letux LX20v0.3
[   21.169592] Stack : 81813d2c 800a9388 01000000 db3faa92 80b90000 80bbe153 00000000 00000000
[   21.169613]         00000000 00000000 00000000 00000000 00000000 00000001 81813ce0 81893180
[   21.169630]         00000000 00000000 80abfbdc 00000000 00000001 81813b3c 00000000 44495020
[   21.169647]         81813b8a 80c175f8 80c1763c 00000020 80b90000 00000000 00000000 80abfbdc
[   21.169665]         00000020 80ab7c68 00000030 800263c8 00000000 00000000 00000004 80c10004
[   21.169683]         ...
[   21.169689] Call Trace:
[   21.169691] [<80029cd0>] show_stack+0x38/0x118
[   21.169707] [<800200d8>] dump_stack_lvl+0x74/0xb0
[   21.169720] [<8094f7b0>] nmi_cpu_backtrace+0x13c/0x144
[   21.169732] [<800263d8>] handle_backtrace+0x10/0x54
[   21.169740] [<801019ec>] __flush_smp_call_function_queue+0x174/0x360
[   21.169751] [<80021da8>] ingenic_xburst2_mbox_handler+0x94/0xc8
[   21.169759] [<800b96ec>] handle_percpu_devid_irq+0xc0/0x194
[   21.169770] [<800b2a90>] handle_irq_desc+0x78/0x90
[   21.169777] [<80971cb0>] do_IRQ+0x18/0x24
[   21.169787] [<800241bc>] handle_int+0x140/0x14c
[   21.169794] [<8004bda0>] irq_exit+0xc8/0x168
[   21.169802] 

A successful boot looks like:

[    0.178235] Key type asymmetric registered
[    0.182320] Asymmetric key parser 'x509' registered
[    0.187383] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.194844] io scheduler mq-deadline registered
[    0.199411] io scheduler kyber registered
[    0.205242] ledtrig-cpu: registered to indicate activity on CPUs
[    0.212132] Serial: 8250/16550 driver, 9 ports, IRQ sharing disabled
[    0.225323] printk: legacy console [ttyS2] disabled
[    0.230880] 10032000.serial: ttyS2 at MMIO 0x10032000 (irq = 45, base_baud = 1500000) is a 16550A
[    0.239960] printk: legacy console [ttyS2] enabled
[    0.239960] printk: legacy console [ttyS2] enabled
[    0.249610] printk: legacy bootconsole [x1000_uart0] disabled
[    0.249610] printk: legacy bootconsole [x1000_uart0] disabled
[    0.266105] 10033000.serial: ttyS0 at MMIO 0x10033000 (irq = 44, base_baud = 1500000) is a 16550A
[    0.283180] brd: module loaded

This could be that some of the spinlocks blocks one second core waiting for the other...
Removing the printk from __request_percpu_irq() made it work again.

So we may more likely have a race condition...

Some additional hint is that I can only boot one of three lx20 boards well enough.
A fluctuation induced by tiny hardware differences (speed of some subystem initializing)
could also explain this.

We would need access to the original x2000 SMP authors but they are no longer active...

BR,
Nikolaus


More information about the Letux-kernel mailing list