[Letux-kernel] New LetuxOS Kernels - strcmp(NULL)

H. Nikolaus Schaller hns at goldelico.com
Sun Jun 24 12:54:38 CEST 2018


> Am 24.06.2018 um 11:38 schrieb Andreas Kemnade <andreas at kemnade.info>:
> 
> On Sun, 24 Jun 2018 09:52:53 +0200
> "H. Nikolaus Schaller" <hns at goldelico.com> wrote:
> 
>> Hi,
>> 
>>> Am 24.06.2018 um 09:11 schrieb Andreas Kemnade <andreas at kemnade.info>:
>>> 
>>> On Sat, 23 Jun 2018 12:13:11 +0200
>>> "H. Nikolaus Schaller" <hns at goldelico.com> wrote:
>>> 
>>> 
>>>> 
>>>> So the issue is that "backlight_pins_pinmux" are searched for a NULL record before they
>>>> are properly stored. Or someone punches a NULL into the radix_tree.
>>>> 
>>>> Hope this sheds some new light on the problem.
>>>> 
>>> hmm, the next question is whether the NULL is *always* there, so even
>>> in the successful boots. Is that still with mainline  sources + minimal
>>> set of things?
>>> 
>>> Can we infer any bad order of module loading from that output?
>>> Probably the thing that inserts the NULL should be loaded last for
>>> successful boots or first for failed boots
>>> 
>>> Or should we remove stuff from dtb piece by piece to see if that helps?  
>> 
>> The problem is that you can remove almost anything and "it helps". So
>> it is very fragile to have a system that runs into this bug. If
>> you change a little piece, the problem disappears but you don't know if
>> it is really the reason or just a factor that enables/disables the real
>> problem to appear/disappear.
> 
> I think we can do it the other way round. Removing drivers stuff which
> does not disable the problem and consider them not guilty. 

Indeed, there are some.

I have identified these:
omap3isp
omapdss

> 
> 
>> 
>> For example you can blackist some modules and it is gone. You
> 
> What do you mean: it is gone? How many times do you test to consider it
> gone? 

Well, I have a scenario where I have it in 4 to 5 out of 5 tests.
And if I do do a change and it is 0 of 3 I can consider it as "gone".

Well, this is not statistically valid but a good guess only.

I have found another printk that makes it go away:

@@ -390,6 +390,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)

+       printk("%s: driver %s\n", __func__, drv->name);

Then, I can see the probe order (there are some more drivers which are probed
without notice in the log), but I have no strcmp(NULL).

> The next question is whether we need really that much concurrency here.
> Can we do:
> 
> create_random_list_of_modules
> log that list
> while(not_everything_loaded) {
>  insmod first_module_in_list.ko && remove_module_from_list
>  sleep ?
> }
> 
> will we get some pattern? Or maybe just get a full list of drivers
> which needs not to be loaded to enable the problem.

Hm. It is difficult to predict the outcome. Especially since deferred
driver probing makes it a little independent of the list order.

> 
>> 
>> Main suspect is the generic-adc-battery driver (although I remember to have seen the
>> strcmp(NULL) once even when blackisting it).
>> 
>> The reason why it is the my main suspect is that the message iio_charge:-747
>> seems to almost always come before platform backlight: Retrying from deferred list
>> (AFAIR in failing and non-failing cases).
>> 
>> But they did not yet confess :)
>> 
>> And not to forget: they may still be innocent (if my latest theory is wrong)...
>> 
>> So I'd suggest to play around with the generic-adc-battery module/driver/dtb.
>> 
> ok, module blacklisted. Will boot that setup the whole afternoon.

Ok. Let's see. Although the interpretation is difficult. blacklisting it 
may simple change memory allocation and timing. It won't prove that the
bug is in the blacklisted driver :(

BR,
Nikolaus



More information about the Letux-kernel mailing list