[Letux-kernel] ti-soc-thermal 48002524.bandgap: eocz timed out waiting high

H. Nikolaus Schaller hns at goldelico.com
Wed Feb 23 21:01:44 CET 2022



> Am 21.02.2022 um 15:10 schrieb H. Nikolaus Schaller <hns at goldelico.com>:
> 
> Hi,
> I better start a new topic on this to report findings.
> 
>>>>>> 
>>>>>> I just started to cross-check letux-5.17-rc4. At the moment it only shows
>>>>>> the
>>>>>> 
>>>>>> [  330.002105] ti-soc-thermal 48002524.bandgap: eocz timed out waiting high
> 
> if I run
> 
> 	while true; do cat /sys/class/thermal/thermal_zone0/temp; done
> 
>>>>>> 
>>>>>> This does not appear to depend on input_current_limit or anything else.
>>>>>> But it also occurs in a high-load situation.
>>>>>> 
>>>>>> This looks like a "real bug" - which hopefully can be bisected more
>>>>>> repeatable.
>>>> 
>>>> Yes, this was easily bisected and seems to be this issue:
>>>> 
>>>> # first bad commit: [514cbabb01422d501d533a6495b924e4c22d4822] thermal: ti-soc-thermal: Simplify polling with iopoll
>>>> 
>>>> What I suspect here is that the minimal waiting time mentioned in the commit
>>>> message may not be enough for the omap3530-600MHz models but for all other
>>>> OMAP variants.
>>> That does sound likely.
>> 
>> I have looked into this and the result makes me smiling.
>> There was no error test in the old code.
>> 
>> So likely the timeout was there for a long time but did not become visible before.
>> 
>> Still the question remains why it only occurs on the oma3530-600MHz. Potentially the
>> waiting times are too small. Having open the omap3530 errata I could not find one
>> related.
>> 
>> But it is described in the TRM in section 7.4.6.2.1 Single Conversion Mode (CONTCONV = 0)
>> and Fig. 7-15.
>> 
>> According to this the SOC bit should be set to 1 and then we have to wait 11-14 cycles
>> of the 32768Hz clock (which is ca. 500µs) and then EOCZ should go high. After conversion
>> ends it goes low and the data is valid (was 4 cycles before but that is not relevant).
>> 
>> What comes to my mind is that this readout is not locked for multiple threads.
>> AFAIK the bandgap ADC is used by multiple kernel-internal and external clients.
>> So if one starts a conversion and a second one tries to start while the first is
>> already running, these tests may interfere. This would happen more often if I
>> read out through /sysfs in a loop.
>> 
>> BTW: there is a spinlock for reading the result in ti_bandgap_read_temperature()
>> but not covering the trigger of a readout.
> 
> With moving the spinlock around the SOC+EOCZ mechanism the  "eocz timed out"
> messages are gone!
> 
> So my assumption that there is some lock missing seems to be true.
> 
> But generally it is not a good idea to have a spinlock around a waiting loop
> of udelay as it stops all other kernel activites (except some interrupts)...
> 
> I will experiment a little more around this maybe to find out where the double
> access attempt is coming from and then probably the best is to discuss this
> finding on the linux-omap mailing list with OMAP specialists.

Despite several experimentations and code reviews I have not yet found
anything.

Especially as all clients (thermal framework polling every second for overtemperature
or access through /sysfs either for thermal_zone0/temp or hwmon API) using the bandgap
ADC are going through thermal_zone_get_temp() which has its own mutex!

https://elixir.bootlin.com/linux/latest/source/drivers/thermal/thermal_helpers.c#L78

So there is no locking needed. Unless interfering background activities are stopped
by doing so.

Ah, I get one idea:

What happens of SOC is written 1 and then some other process interferes while
waiting for EOCZ going high?

If yes and the interference is long enough, is EOCZ already being reset before
ti_bandgap_force_single_read() can find EOCZ being 1?

Fig. 7.15 does not describe this situation.

So the processor may be blind for seeing EOCZ going high and low again. And
then waits until timeout?

In that case we can not do much about it. The readout value likely is correct.
So we just can ignore the timeout and print no warning. More critical is a
timeout for eocz going low. That indicates a real issue with the hardware then.

BR,
Nikolaus



More information about the Letux-kernel mailing list