[Letux-kernel] SMP issue between LX16 and LX20 found...
H. Nikolaus Schaller
hns at goldelico.com
Sun Jun 15 12:49:05 CEST 2025
> Am 15.06.2025 um 12:07 schrieb H. Nikolaus Schaller <hns at goldelico.com>:
>
> Hi,
>
>> Am 14.06.2025 um 23:08 schrieb Paul Boddie <paul at boddie.org.uk>:
>>
>> On Saturday, 14 June 2025 19:58:46 CEST H. Nikolaus Schaller wrote:
>>>
>>> So we have the issue that on x1600, ingenic_ost_cevt_cb()
>>> breaks when doing ost->soc_info->version and the pointer
>>> ost->soc_info is "randomized". This means that ost = timer->ost
>>> memory area is overwritten by someone.
>>>
>>> What is the easiest way to find out who overwrites this?
>>
>> I think we need a good mental model of how this is supposed to work, which I
>> don't think I have really focused on until now. It seems to me that both CPUs
>> (processor cores) access the same physical memory range for the peripherals,
>
> Yes, they do.
>
>> so unless there is some kind of extra layer or duplicate sets of some
>> peripherals, both CPUs are capable of accessing the same registers and
>> operating on the same peripherals.
>>
>> So, it seems desirable that only one CPU would set up any given peripheral.
>
> I think that is what the SMP code in the kernel usually does.
>
>> Structures allocated during this process would, of course, be accessible later
>> by both CPUs, but they would only allocated once by a particular CPU. Once
>> initialised, a given CPU might need to access a peripheral, perhaps in
>> response to an interrupt, and this would need to be done without disruption
>> from the other CPU.
>
> This is usually done by some dev_id or smp_processor_id() like in __request_percpu_irq().
>
>> Usually, people focus on things like locking primitives when discussing multi-
>> CPU support, which is obviously relevant, but what I am missing is the control
>> over initialisation, whether it must be done using a single CPU, whether it
>> could be distributed across both CPUs, and how this is controlled.
>
> The strange thing all that it works on the dual core CPUs like jz4780 or x2000
> but fails in case there is only a single CPU as in the x1600. This indicates a
> different issue (memory management? cache management? compiler bug?)
> than what you are thinking of.
>
> BTW: my current x2000 boot often fails like:
>
> [ 0.164566] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> [ 0.166649] io scheduler mq-deadline registered
> [ 0.168442] io scheduler kyber registered
> [ 21.170219] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 21.170872] rcu: 1-....: (1 ticks this GP) idle=073c/1/0x40000002 softirq=8/9 fqs=1050
> [ 21.170801] rcu: (detected by 0, t=2102 jiffies, g=-1171, q=37 ncpus=2)
> [ 21.169409] Sending NMI from CPU 0 to CPUs 1:
> [ 21.169420] NMI backtrace for cpu 1
> [ 21.169427] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT
> [ 21.169437] Hardware name: Letux LX20v0.3
> [ 21.169442] $ 0 : 00000000 fffffff0 8100f540 8100f540
> [ 21.169457] $ 4 : 00407000 80c08540 04200042 00000001
> [ 21.169468] $ 8 : 00000000 ffffffe7 500015a4 00002a44
> [ 21.169480] $12 : ffffffff 0000b67e 00000000 00000000
> [ 21.169490] $16 : 80b8a4a0 80c08540 80b7c040 0000000a
> [ 21.169501] $20 : 80c052f0 ffff8afc 40224f64 d1e00046
> [ 21.169512] $24 : 7460f159 8054eb10
> [ 21.169523] $28 : 818a0000 81813f80 00000080 8004bda0
> [ 21.169534] Hi : 00002a44
> [ 21.169537] Lo : 500015a4
> [ 21.169540] epc : 8004b164 handle_softirqs+0xd4/0x2d4
> [ 21.169554] ra : 8004bda0 irq_exit+0xc8/0x168
> [ 21.169561] Status: 14001f03 KERNEL EXL IE
> [ 21.169571] Cause : 00801800 (ExcCode 00)
> [ 21.169575] PrId : 00132000 (Ingenic XBurst II)
> [ 21.169580] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc1-letux-mips+ #3265 PREEMPT
> [ 21.169588] Hardware name: Letux LX20v0.3
> [ 21.169592] Stack : 81813d2c 800a9388 01000000 db3faa92 80b90000 80bbe153 00000000 00000000
> [ 21.169613] 00000000 00000000 00000000 00000000 00000000 00000001 81813ce0 81893180
> [ 21.169630] 00000000 00000000 80abfbdc 00000000 00000001 81813b3c 00000000 44495020
> [ 21.169647] 81813b8a 80c175f8 80c1763c 00000020 80b90000 00000000 00000000 80abfbdc
> [ 21.169665] 00000020 80ab7c68 00000030 800263c8 00000000 00000000 00000004 80c10004
> [ 21.169683] ...
> [ 21.169689] Call Trace:
> [ 21.169691] [<80029cd0>] show_stack+0x38/0x118
> [ 21.169707] [<800200d8>] dump_stack_lvl+0x74/0xb0
> [ 21.169720] [<8094f7b0>] nmi_cpu_backtrace+0x13c/0x144
> [ 21.169732] [<800263d8>] handle_backtrace+0x10/0x54
> [ 21.169740] [<801019ec>] __flush_smp_call_function_queue+0x174/0x360
> [ 21.169751] [<80021da8>] ingenic_xburst2_mbox_handler+0x94/0xc8
> [ 21.169759] [<800b96ec>] handle_percpu_devid_irq+0xc0/0x194
> [ 21.169770] [<800b2a90>] handle_irq_desc+0x78/0x90
> [ 21.169777] [<80971cb0>] do_IRQ+0x18/0x24
> [ 21.169787] [<800241bc>] handle_int+0x140/0x14c
> [ 21.169794] [<8004bda0>] irq_exit+0xc8/0x168
> [ 21.169802]
>
> A successful boot looks like:
>
> [ 0.178235] Key type asymmetric registered
> [ 0.182320] Asymmetric key parser 'x509' registered
> [ 0.187383] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> [ 0.194844] io scheduler mq-deadline registered
> [ 0.199411] io scheduler kyber registered
> [ 0.205242] ledtrig-cpu: registered to indicate activity on CPUs
> [ 0.212132] Serial: 8250/16550 driver, 9 ports, IRQ sharing disabled
> [ 0.225323] printk: legacy console [ttyS2] disabled
> [ 0.230880] 10032000.serial: ttyS2 at MMIO 0x10032000 (irq = 45, base_baud = 1500000) is a 16550A
> [ 0.239960] printk: legacy console [ttyS2] enabled
> [ 0.239960] printk: legacy console [ttyS2] enabled
> [ 0.249610] printk: legacy bootconsole [x1000_uart0] disabled
> [ 0.249610] printk: legacy bootconsole [x1000_uart0] disabled
> [ 0.266105] 10033000.serial: ttyS0 at MMIO 0x10033000 (irq = 44, base_baud = 1500000) is a 16550A
> [ 0.283180] brd: module loaded
>
> This could be that some of the spinlocks blocks one second core waiting for the other...
> Removing the printk from __request_percpu_irq() made it work again.
>
> So we may more likely have a race condition...
>
> Some additional hint is that I can only boot one of three lx20 boards well enough.
> A fluctuation induced by tiny hardware differences (speed of some subystem initializing)
> could also explain this.
>
> We would need access to the original x2000 SMP authors but they are no longer active...
We may indeed have a cache or memory issue. Here is a boot log from the same µSD which did once fail
and once boot as above now in a different LX20 unit:
[ 0.473419] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support!
[ 0.737557] EXT4-fs (mmcblk0p2): recovery complete
[ 0.743644] EXT4-fs (mmcblk0p2): mounted filesystem 029581e7-6fd0-452d-a36a-94440b89df1d r/w with journalled data mode. Quota mode: none.
[ 0.753386] VFS: Mounted root (ext4 filesystem) on device 179:2.
[ 0.783534] devtmpfs: mounted
[ 0.783984] Freeing unused kernel image (initmem) memory: 320K
[ 0.789642] This architecture does not have kernel memory protection.
[ 0.796092] Run /sbin/init as init process
[ 0.800129] with arguments:
[ 0.803086] /sbin/init
[ 0.805762] with environment:
[ 0.808882] HOME=/
[ 0.811220] TERM=linux
Mount failed for selinuxfs on /sys/fs/selinux: No such file or directory
INIT: version 2.88 booting
[info] Using makefile-style concurrent boot in runlevel S.
[....] Starting the hotplug events dispatcher: udevd[ 2.451279] systemd-udevd[578]: starting version 215
^[[c.
[ ok ] Synthesizing the initial hotplug events...done.
^[[c[....] Waiting for /dev to be fully populated...[ 3.975971] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:1220: group 37, block bitmap and bg descriptor inconsistent: 32256 vs 32255 free clusters
[ 4.063505] Reserved instruction in kernel code[#1]:
[ 4.063086] CPU: 0 UID: 0 PID: 578 Comm: udevd Not tainted 6.16.0-rc1-letux-mips+ #3267 PREEMPT
[ 4.063643] Hardware name: Letux LX20v0.3
[ 4.064899] $ 0 : 00000000 00000001 00000000 00000001
[ 4.064636] $ 4 : 00000000 6a29ecbe 00000000 00000000
[ 4.064375] $ 8 : 670b3220 80bb0f90 00000000 6a6be704
[ 4.064113] $12 : 7b6854d3 8ec1c2c8 81ff52e0 20a1c866
[ 4.063852] $16 : 00000001 80b8a4a0 80b90000 81003c00
[ 4.063591] $20 : 00000000 81ff5280 00000000 80b90000
[ 4.063329] $24 : 00000000 81ff52e0
[ 4.063067] $28 : 85366000 85367d18 80c30000 809699e4
[ 4.062806] Hi : 00000000
[ 4.062938] Lo : 00000000
[ 4.063065] epc : 809684ac __schedule+0x88/0x1590
[ 4.062457] ra : 809699e4 schedule+0x30/0xf8
[ 4.064232] Status: 14001c03 KERNEL EXL IE
[ 4.062931] Cause : 08800028 (ExcCode 0a)
[ 4.064187] PrId : 00132000 (Ingenic XBurst II)
[ 4.063320] Modules linked in:
[ 4.063622] Process udevd (pid: 578, threadinfo=(ptrval), task=(ptrval), tls=77eef740)
[ 4.063317] Stack : 20a1c866 35942c1a 7d865353 0fe680fe b3b84ed5 6a29ecbe 00000004 80bb0f3c
[ 4.063444] 00000000 85367de0 00000040 80bb0f6c 80b90000 80b90000 80c30000 8048a740
[ 4.063573] 00000001 00000004 00000000 00000040 80b90000 6a29ecbe 81ff5280 85367de0
[ 4.063700] 80b90000 85367de4 80b90000 00000003 80b90000 80b90000 80c30000 809699e4
[ 4.063828] 14001c03 00000001 85367de0 80b90000 00000001 8001b928 00000001 6a29ecbe
[ 4.063956] ...
[ 4.063654] Call Trace:
[ 4.063348] [<809684ac>] __schedule+0x88/0x1590
[ 4.062397] [<809699e4>] schedule+0x30/0xf8
[ 4.063824] [<8001b928>] try_to_generate_entropy+0x254/0x2a8
[ 4.063996] [<805d9090>] urandom_read_iter+0x104/0x10c
[ 4.063647] [<801cc448>] vfs_read+0x2a4/0x358
[ 4.062520] [<801cd098>] ksys_read+0x90/0x128
[ 4.064123] [<80034028>] syscall_common+0x34/0x58
[ 4.063343]
[ 4.064814] Code: 30421800 1440024b 00000000 <41606000> 0c0336d3 02c02025 00002825 0c01df15 02602025
[ 4.063598]
[ 4.062396] ---[ end trace 0000000000000000 ]---
[ 4.064239] Kernel panic - not syncing: Fatal exception
[ 4.066685] Rebooting in 10 seconds..
More information about the Letux-kernel
mailing list