[Letux-kernel] thermal madness

H. Nikolaus Schaller hns at goldelico.com
Sat Sep 14 12:17:22 CEST 2019


> Am 14.09.2019 um 10:22 schrieb H. Nikolaus Schaller <hns at goldelico.com>:
> 
> Hi,
> 
>> Am 13.09.2019 um 22:27 schrieb Andreas Kemnade <andreas at kemnade.info>:
>> 
>> On Fri, 13 Sep 2019 21:51:40 +0200
>> "H. Nikolaus Schaller" <hns at goldelico.com> wrote:
>> 
>>>> Am 13.09.2019 um 21:44 schrieb Andreas Kemnade <andreas at kemnade.info>:
>>>> 
>>>> Hi,
>>>> 
>>>> I was experimenting a bit a bit with the thermal:
>>>> 
>>>> fresh after rebooting and autoidling uarts and loading some modules I
>>>> made the letux3704 device consume 32mA, so I expect the temperature to
>>>> being low.  
>>> 
>>> Indeed. I usually have the GTA04 up and running with X11 etc. so it is
>>> significantly warmer.
>>> 
>>>> Reading the thermal gives this:
>>>> 
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 58500
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 47000
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 47000
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 47000
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 47000
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 47000
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 48500
>>>> root@(none):/# cat /sys/devices/virtual/thermal/thermal_zone0/temp 
>>>> 48500
>>>> 
>>>> That is just the opposite to what Nikolaus was getting. Here it jumps
>>>> down instead of up and stays stable.  
>>> 
>>> Oops!!!
>>> 
>>>> My conclusion: the measurements are buffered somewhere/somehow and we
>>>> are getting something old here.   
>>> 
>>> Or there is some other bug in the code...
>>> 
>>> I have tried to understand the code a little but it just reads some
>>> registers... And translates ADC values to celsius.
>>> 
>>> And, there is some feature to handle temperature trends. This seems
>>> to read multiple registers.
>>> 
>>> And in some case it may not be possible to read a value and then
>>> it returns a previous one.
>>> 
>>> Hm. What if that situation is true for the first read? But the
>>> previous is random?
>>> 
>>> Another test: I also did run the first cat command in a loop
>>> 
>>> for i in 1 2 3 4 5 6 7 8 9 10
>>> do
>>> 	cat /sys/devices/virtual/thermal/thermal_zone0/temp
>>> 	sleep 0.1
>>> done
>>> 
>> some more testing here:
>> root@(none):/# cpufreq-info 
>> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
>> Report errors and bugs to cpufreq at vger.kernel.org, please.
>> analyzing CPU 0:
>> driver: cpufreq-dt
>> CPUs which run at the same hardware frequency: 0
>> CPUs which need to have their frequency coordinated by software: 0
>> maximum transition latency: 300 us.
>> hardware limits: 300 MHz - 1000 MHz
>> available frequency steps: 300 MHz, 600 MHz, 800 MHz, 1000 MHz
>> available cpufreq governors: conservative, userspace, powersave, ondemand, performance
>> current policy: frequency should be within 300 MHz and 1000 MHz.
>>                 The governor "ondemand" may decide which speed to use
>>                 within this range.
>> current CPU frequency is 600 MHz (asserted by call to hardware).
>> cpufreq stats: 300 MHz:94.86%, 600 MHz:2.49%, 800 MHz:0.92%, 1000 MHz:1.73%  (51)
>> root@(none):/# cd /sys/bus/platform/drivers/omap_uart/
>> _delay_ms ; echo 3000 >$name ; doners/omap_uart# for name in */power/autosuspend 
>> root@(none):/sys/bus/platform/drivers/omap_uart# for name in */power/autosuspen
>> _delay_ms ; do echo 3000 >$name ; doners/omap_uart# for name in */power/autosuspend_                                                                                                          
>> root@(none):/sys/bus/platform/drivers/omap_uart# 
>> root@(none):/sys/bus/platform/drivers/omap_uart# cd
>> w ot@(none):/# sleep 16 ; cat /sys/class/power_supply/bq27000-battery/current_now
>> 32844
>> rmal/thermal_zone0/temp ; sleep 0.1 ; done 8 9 ; do cat /sys/devices/virtual/the 
>> 58500
>> 47000
>> 47000
>> 48500
>> 48500
>> 48500
>> 48500
>> 48500
>> 48500
>> 48500
>> root@(none):/# 
>> 
>> That is with
>> commit d71fc15bce98abf24226f451f192df07ab9d089b
>> We are seeing 1Ghz here on the letux3704 without any boost.
> 
> I have studied the TRM and we can also read the bandgap sensor through
> devmem2 and that indeed indicates some strange effect by the driver
> 
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6fe7000.
> Value at address 0x48002524 (0xb6fe7524): 0x38
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6f09000.
> Value at address 0x48002524 (0xb6f09524): 0x38
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6fbd000.
> Value at address 0x48002524 (0xb6fbd524): 0x38
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 58500
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6fb8000.
> Value at address 0x48002524 (0xb6fb8524): 0x34
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6fc9000.
> Value at address 0x48002524 (0xb6fc9524): 0x34
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6f90000.
> Value at address 0x48002524 (0xb6f90524): 0x34
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6f1a000.
> Value at address 0x48002524 (0xb6f1a524): 0x34
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# 
> 
> This time I had the temperature also going down!?
> 
> Well, this was a boot where not all modules were loaded and
> there is no display (haven't checked why).
> 
> It looks as if only reading the thermal_zone twice makes the value
> update:
> 
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6f48000.
> Value at address 0x48002524 (0xb6f48524): 0x37
> root at letux:~# ./temperatures 
> Sat Sep 14 07:50:13 UTC 2019 57° 800MHz
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6f88000.
> Value at address 0x48002524 (0xb6f88524): 0x37
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 57000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# /usr/bin/arm-linux-gnueabihf/devmem2 0x48002524
> /dev/mem opened.
> Memory mapped at address 0xb6fe6000.
> Value at address 0x48002524 (0xb6fe6524): 0x34
> root at letux:~# 
> 
> And if my ./temperatures script creates processor load that
> increases the temp which goes down quickly
> 
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# ./temperatures 
> Sat Sep 14 07:52:16 UTC 2019 52° 800MHz
> root at letux:~# ./temperatures 
> Sat Sep 14 07:52:19 UTC 2019 57° 800MHz
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 57000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 53500
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 53500
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~# cat /sys/devices/virtual/thermal/thermal_zone0/temp
> 52000
> root at letux:~#
> 
> The measured temperature with thermocamera of the surface of the PoP is ca. 30°.
> 
> This brings me to another idea: we could also read the value from
> 0x48002524 in U-Boot, maybe after manually triggering a conversion.
> 
> Well of course all kernel tests should be done with thermal throttling
> turned off so that the govenors do not read the bandgap sensor all
> the time.

After experimenting with the 90°C thermal regulation I think the
bandgap sensor itself is quite good. Has no visible influence
from power supply and reports reasonable values.

But there is an observation that may explain part of the behaviour:

The sensor is very fast. It may be faster than running two
> cat /sys/devices/virtual/thermal/thermal_zone0/temp
in succession.

And temperature changes on chip may also be very quick. Maybe
less than 0.1 seconds if we step from OPP50 to OPP1G.

So we must make sure that the heat generated inside the chip
is constant before we can say the sensor is reading wrong values.

Now, what about the initial value issue.

Let's assume the sensor always reports the value of the last
conversion. Maybe because it has the wrong polarity for start
conversion or checking the end of conversion flag/interrupt or
whatever.

This would mean that likely during probing a first conversion
is started. Its value is sitting around until we do the first
> cat /sys/devices/virtual/thermal/thermal_zone0/temp
Then we do not get the temperature after running for a while
but the one during boot.

I cross-checked that idea a little and indeed the first value
after a "reboot" was higher than the first after a cold boot.
So something must have remembered that the OMAP was already
running before...

This could explain why I usually get lower values after first
read which jump up and why Andreas could get a value going down.

And why there is no difference between omap3430 (beagleboard)
and dm3730 (GTA04) even with older kernels (if the bug is older
than 4.12 which was the earliest SD card I had found in the
BeagleBoard). We could try to do a cross-check with 3.12 or even
the 3.7 kernel (if user-space is still compatible to boot).

Finally, if we repeatedly read the sensor we won't notice a big
difference in seeing the current or the previous value because
initial temperature changes are quick but then the heat distributes
slowly over the chip and the plastics and PCB etc.

Maybe we can verify if the code has such a bug. Maybe with
my U-Boot readout idea (to have no influence of drivers).

BR,
Nikolaus



More information about the Letux-kernel mailing list