[Letux-kernel] ti-soc-thermal 48002524.bandgap: eocz timed out waiting high

H. Nikolaus Schaller hns at goldelico.com
Mon Feb 21 15:10:18 CET 2022


Hi,
I better start a new topic on this to report findings.

>>>>> 
>>>>> I just started to cross-check letux-5.17-rc4. At the moment it only shows
>>>>> the
>>>>> 
>>>>> [  330.002105] ti-soc-thermal 48002524.bandgap: eocz timed out waiting high

if I run

	while true; do cat /sys/class/thermal/thermal_zone0/temp; done

>>>>> 
>>>>> This does not appear to depend on input_current_limit or anything else.
>>>>> But it also occurs in a high-load situation.
>>>>> 
>>>>> This looks like a "real bug" - which hopefully can be bisected more
>>>>> repeatable.
>>> 
>>> Yes, this was easily bisected and seems to be this issue:
>>> 
>>> # first bad commit: [514cbabb01422d501d533a6495b924e4c22d4822] thermal: ti-soc-thermal: Simplify polling with iopoll
>>> 
>>> What I suspect here is that the minimal waiting time mentioned in the commit
>>> message may not be enough for the omap3530-600MHz models but for all other
>>> OMAP variants.
>> That does sound likely.
> 
> I have looked into this and the result makes me smiling.
> There was no error test in the old code.
> 
> So likely the timeout was there for a long time but did not become visible before.
> 
> Still the question remains why it only occurs on the oma3530-600MHz. Potentially the
> waiting times are too small. Having open the omap3530 errata I could not find one
> related.
> 
> But it is described in the TRM in section 7.4.6.2.1 Single Conversion Mode (CONTCONV = 0)
> and Fig. 7-15.
> 
> According to this the SOC bit should be set to 1 and then we have to wait 11-14 cycles
> of the 32768Hz clock (which is ca. 500µs) and then EOCZ should go high. After conversion
> ends it goes low and the data is valid (was 4 cycles before but that is not relevant).
> 
> What comes to my mind is that this readout is not locked for multiple threads.
> AFAIK the bandgap ADC is used by multiple kernel-internal and external clients.
> So if one starts a conversion and a second one tries to start while the first is
> already running, these tests may interfere. This would happen more often if I
> read out through /sysfs in a loop.
> 
> BTW: there is a spinlock for reading the result in ti_bandgap_read_temperature()
> but not covering the trigger of a readout.

With moving the spinlock around the SOC+EOCZ mechanism the  "eocz timed out"
messages are gone!

So my assumption that there is some lock missing seems to be true.

But generally it is not a good idea to have a spinlock around a waiting loop
of udelay as it stops all other kernel activites (except some interrupts)...

I will experiment a little more around this maybe to find out where the double
access attempt is coming from and then probably the best is to discuss this
finding on the linux-omap mailing list with OMAP specialists.

BR,
Nikolaus


More information about the Letux-kernel mailing list