[Letux-kernel] 1.5GHz problems

H. Nikolaus Schaller hns at goldelico.com
Fri Jul 29 16:00:24 CEST 2016

Hi all,
we know that the Pyra CPU boards (at least the 3 units we have running)
make problems when we use OPP to allow for 1.5GHz. The kernel suddenly
hangs without obvious and repeatable error messages.

At 1.0 GHz (or 1.5 GHz and disabling the second core) the OMAP5432 works.

To get some more insights I have done some tests.

* Board M4+C19 w/o display
* Kernel: letux-4.7.0
* 500MHz + 750MHz OPP runs accoding to default DT
* 1.0GHz OPP in DT modified to check what happens
* temperature driven by /root/high-load (prints 3 temperature hwmon values every second)

A) 1.0GHz at 1060000uV
kernel boot:	ok
high-load:	runs unlimited

This is what works since march 2016.

B) 1.1GHz at 1060000uV
kernel boot:	ok
high-load:	reaches 97°C after 25 min
cpufreq-info:	96%@1.1 GHz

I remember that temperature was ~92°C at 1.0 GHz so this drives
the temperature up by 5K.

C1) 1.3GHz at 1060000uV
kernel boot:	hangs during initial boot

Note:	hang means the CPU isn't responding on serial interface and status LEDs are no longer blinking

obviously the voltage is too low for 1.3 GHz.

C3) 1.0GHz at 1060000uV + 1.3GHz at 1150000uV
kernel boot:	ok
high-load:	hangs after 15 seconds after reaching 63°C

repeated boot attempts:
C3a) high-load:	hangs after some seconds at 64°C

C3b) high-load:	runs >15 min
				ramps up to 100-103°C suddenly jumps down to 82-95°C.
				reaches after ca. 10 sec again >100°C.
				As if some over temperature protection throttles the CPU clock
PCB temperature:	80°C
cpufreq-info:		just 73%@1.3 GHz

C3c) high-load:	hangs after 15 sec at 65°C

This means it runs not 100% reliable at this OPP and the effect
seems to be temperature dependent. But if the OMAP runs it
comes into a temperature limit which triggers some overtemp
protection built into the kernel.

D) 1.0GHz at 1060000uV + 1.5GHz at 1150000uV
kernel boot:	hangs after 4.3-4.4 sek (3 times reproducible)

E1) 1.5GHz at 1250000uV:	
kernel boot:	hangs after 6.3-6.5 sek (3 times reproducible)

E2) 1.5GHz at 1300000uV (close to upper limit according to "Data Manual Operating Condition Addendum Version 0.6"):
kernel boot:	hangs at 6.6 sek

F) test E1 + OMAP5_ERRATA_801819 enabled
kernel boot:	hangs again 6.4 sek (in "Synthesizing the initial hotplug events...")

it was not possible to boot in dual core 1.5 GHz mode. Very strange and unexpected
is that the kernel hangs repeatedly at 6.3-6.6 seconds as if there is something in the
code which increases the risk of a hang (deadlock).

So it is either a kernel software issue (something critical is running faster
at 1.5 Ghz resulting in a deadlock). Or the voltage is still too low. But I did not
dare to increase it further since it may destroy the valuable CPU board...

G) [Kernel] omap5 mpu bridge dividers
Matthijs recently reported a potential issue here with the above subject line.

A simple test would be to boot at 1.5Ghz and then run

	omapconf write 0x4A004320 0x06000001

But I can't even boot at 1.5Ghz so I have no chance to test.

Summary / Discussion:
* it looks as if 1 GHz (or single core 1.5 GHz) works without problems
* for 1.3 GHz we have to increase CPU voltage or the kernel hangs
* at 1.5 GHz I wasn't able to boot even with increasing CPU voltage

The data sheets hint at using AVS and ABB.

AVS and ABB Requirements
Adaptive Voltage Scaling (AVS) and Adaptive Body Biasing (ABB) are required on most of the VDD_* domains as defined in Table 4-7"

Table 4-7 indirectly defines all operation points >1.0 GHz as required.

"	• The AVS Voltages are device-dependent, voltage domain-dependent, and OPP-dependent. They must be read from the CONTROL_STD_FUSE_OPP_VDD Registers in the Control Module Section of the TRM."

From this I read that every sigle OMAP chip is slightly different and TI measures these differences during production.

This should be done by the AVS drivers.

We did not have CONFIG_POWER_AVS_OMAP enabled but only CONFIG_POWER_AVS.

But although I changed that and did some additional tests, it has no influence.
And the AVS seems to be incomplete and non-operational anyways:

[    4.977605] sr_init: No PMIC hook to init smartreflex
[    4.982922] driver_register 'smartreflex'
[    4.987747] sr_init: platform driver register failed for SR

... and the kernel hangs again at 6.43 sec. As if there is a watchdog timer in the OMAP that is only running in 1.5GHz mode...

So I think a hardware issue is quite unlikely, especially as the 1.5 GHz setup hangs always at
the same 6.3-6.6 seconds after Linux Start.

And before I waste more and more weeks on looking for really difficult to grab hardware issues
I would like to hear kernel-specialist's opinions first.

BR and thanks,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.goldelico.com/pipermail/letux-kernel/attachments/20160729/9dfee846/attachment.html>

More information about the Letux-kernel mailing list