[Letux-kernel] New LetuxOS Kernels - strcmp(NULL)

H. Nikolaus Schaller hns at goldelico.com
Sun Jun 24 09:52:53 CEST 2018


Hi,

> Am 24.06.2018 um 09:11 schrieb Andreas Kemnade <andreas at kemnade.info>:
> 
> On Sat, 23 Jun 2018 12:13:11 +0200
> "H. Nikolaus Schaller" <hns at goldelico.com> wrote:
> 
> 
>> 
>> So the issue is that "backlight_pins_pinmux" are searched for a NULL record before they
>> are properly stored. Or someone punches a NULL into the radix_tree.
>> 
>> Hope this sheds some new light on the problem.
>> 
> hmm, the next question is whether the NULL is *always* there, so even
> in the successful boots. Is that still with mainline  sources + minimal
> set of things?
> 
> Can we infer any bad order of module loading from that output?
> Probably the thing that inserts the NULL should be loaded last for
> successful boots or first for failed boots
> 
> Or should we remove stuff from dtb piece by piece to see if that helps?

The problem is that you can remove almost anything and "it helps". So
it is very fragile to have a system that runs into this bug. If
you change a little piece, the problem disappears but you don't know if
it is really the reason or just a factor that enables/disables the real
problem to appear/disappear.

For example you can blackist some modules and it is gone. You
can blacklist others and it is also gone. You can add a printk
to some pinctrl core code and it is gone. How should this be interpreted?
A driver problem or a pinmux core problem?

And it maybe completely invisible in mainline because the trigger
is missing.

So it may well be that if you play with changing DTB or sequence
of module loads (which is also changed by deferred probing), that
you never see the bug but only in the full Letux OS setup.

Another hint that DTB has no significant influence is that I can
see the bug on GTA04 and on Pyra. With almost the same symptom
but quite different DTBs. And OMAP3 vs. OMAP5.

At the moment my theory is:

* pinmux allocates memory for group->pins
* some other driver allocates memory
* pinmux does the devm_kzalloc for the struct group_desc and stores it in the radix tree,
  sets group->name and group->pins
* some other driver does an out-of-bounds write with NULL
* and exactly hits the group->name

So the repeatable randomness and fragile dependency on modules
and drivers and probing order comes from the nesting of memory
allocations and which memory addresses are assigned.

If this "some other driver" does run in a slightly different timing,
the sequence would be:

* pinmux allocates memory for group->pins
* pinmux does the devm_kzalloc for the struct group_desc and stores in the radix tree,
  sets group->name and group->pins
* some other driver allocates memory
* other driver does an out-of-bounds write with NULL
* does not hit the group->name (but maybe something else which isn't noticed that easily
  but leads to other spurious problems)

I have started to interrogate some suspects for doing this out-of-bounds access:

[    6.648223] omap_hdq 480b2000.1w: OMAP HDQ Hardware Rev 0.5. Driver in Interrupt mode
[    6.728515] i2c 1-0030: Retrying from deferred list
[    6.738250] w1_master_driver w1_bus_master1: Attaching one wire slave 01.000000000000 crc 3d
[    6.747406] i2c 1-0030: Retrying from deferred list
[    6.798980] i2c 1-0030: Retrying from deferred list
[    6.812499] (NULL device *): hwmon: 'bq27000-battery' is not a valid name attribute, please fix
[    6.840545] bq27xxx_battery_settings: power_supply_get_battery_info failed ret=-1088380908
[    6.882781] pwm-backlight backlight: backlight supply power not found, using dummy regulator
[    6.966156] (NULL device *): hwmon: 'gta04-battery' is not a valid name attribute, please fix
[    6.994628] wwan_on_off_init: wwan_on_off_init
[    7.014556] pps_core: LinuxPPS API ver. 1 registered
[    7.037109] i2c 1-0030: Retrying from deferred list
[    7.042602] iio_charge:-747
[    7.048522] platform backlight: Retrying from deferred list
[    7.054901] pinctrl_generic_get_group_name: group>name is NULL

I.e. all drivers being probed around the same time.

Main suspect is the generic-adc-battery driver (although I remember to have seen the
strcmp(NULL) once even when blackisting it).

The reason why it is the my main suspect is that the message iio_charge:-747
seems to almost always come before platform backlight: Retrying from deferred list
(AFAIR in failing and non-failing cases).

But they did not yet confess :)

And not to forget: they may still be innocent (if my latest theory is wrong)...

So I'd suggest to play around with the generic-adc-battery module/driver/dtb.

BR,
Nikolaus



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.goldelico.com/pipermail/letux-kernel/attachments/20180624/f3cb78a0/attachment.asc>


More information about the Letux-kernel mailing list