[Letux-kernel] Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM

Sat Aug 1 22:57:36 CEST 2020

A tiny bit more information, if anyone has any more ideas.

I can confirm that this happened once with the device idle, and no
networking connection.

Based on the information I have been able to extract, the call stack does
seem to involve omap4_enter_lowpower but I can't be certain.

The main JTAG access I have is to be able to read out what seems to be
kernel virtual memory via the other, non-locked-up but WFI, core. I
attempted to add some tracing via writing a value to a global variable
inside the problem function and then flushing the D$, but the delay this
adds (or the cache flush itself) seems to stop the lockup from occuring
most of the time. It did lock up once with this added, but then reading
out that area of memory failed, possibly because the locked up core was
confusing the cache coherency magic inside the cores.

Since that lock-up I added 20 NOPs after the cache flush, to try and make
sure the cache flush really does work, and with those added it does not
lock up at all.

Is there a better way to take advantage of this ability to read out
memory for debugging?

Best

David

On Sun, 2020-07-26 at 18:59 +0100, David Shah wrote:
> Hi all,
> 
> I am looking into random lockups - significantly rarer than once a day
> in typical usage, various patterns like lots of bursty network traffic
> increase frequency - that affect both the uEVM and the Pyra (also
> OMAP5432 based) on newer kernels (currently testing with 5.6 but I have
> seen lockups with 5.7 too).
> 
> Currently I'm working with the uEVM as it is a bit easier to connect
> the JTAG adapter. I managed to get a lockup with the JTAG attached, and
> unfortunately the processor is badly locked up enough (presumably a
> stuck memory bus?) that JTAG isn't able to get a register dump or
> stacktrace. But I do get the following error which at least gives a
> PC: 
> 
> CortexA15_0: Trouble Halting Target CPU: (Error -1323 @ 0xC0223E0C)
> Device failed to enter debug/halt mode because pipeline is stalled.
> Power-cycle the board. If error persists, confirm configuration and/or
> try more reliable JTAG settings (e.g. lower TCLK). (Emulation package
> 9.2.0.00002) 
> 
> The second core is just sitting at WFI, don't think there is anything
> suspicious about that.
> 
> Looking at the kernel disassembly this is the actual register read (ldr
> r0, [r1]) part of omap4_prminst_read_inst_reg.
> 
> My best guess is that it is trying to read from a register that doesn't
> exist or isn't responding due to the current power configuration, but I
> wonder if anyone has seen this before or has any more clues on how to
> debug this? It's a shame that I can't seem to see what r1 is or get a
> backtrace. It looks like it might be possible to set some kind of
> timeout on the interconnect, has anyone tried something like that to
> debug this kind of issue?
> 
> Best
> 
> David Shah
> 
>