[Letux-kernel] Pandora: XUDF and other issues

H. Nikolaus Schaller hns at goldelico.com
Mon Feb 21 13:37:54 CET 2022


Hi,


> Am 21.02.2022 um 02:42 schrieb Grond <grond66 at riseup.net>:
> 
> On Sun, Feb 20, 2022 at 02:12:40PM +0100, H. Nikolaus Schaller wrote:
>> Hi,
>> 
>> 
>>> Am 20.02.2022 um 08:21 schrieb Grond <grond66 at riseup.net>:
>>> 
>>> On Sat, Feb 19, 2022 at 05:25:00PM +0100, H. Nikolaus Schaller wrote:
>>>> Hi Grond,
>>>> 
>>>> 
>>>> Please can you try to find out if it depends on USB charging, backlight
>>>> or similar effects on your unit?
>>> I will try. But there are a few caveats with this. We appear to be
>>> experiencing two different sets of symptoms. For you, the bus timeout/
>>> XUDF issue occurs when power consumption goes above some threshold. On
>>> my unit it appears to be caused by something totally jamming the SCL
>>> line (it is completely pulled down starting at boot time, and never goes
>>> high). It is constant (100% of bus transactions fail) and happens
>>> reliably across kernel versions. Right now, I have the battery pulled
>>> for a few days in the probably vain hope that this changes anything, but
>>> when I'm done with that experiment, I'll give this a try.
>> 
>> Well, what I could suspect is that the bq27xxx fuel gauge chip is running
>> outside of its specs in some cases. This may be less or more harmul on
>> some devices.
>> 
>> Looks like I have to think about and study schematics/data sheets...
>> 
>> What also could be is that the bq27xxx on your unit is broken. Or wrongly
>> programmed. I remember there was a tool to write it - but that requires
>> i2c to work.
> Yeah, I'm reasonably certain that *something* on that bus is broken. The
> trick is just figuring out what without unsoldering each component one
> at a time...

Indeed...

What I have done was to run the Pandora (with the high-load script) from
battery until it turned off. It took a little more than 7 hours and turned off at
2.7V battery voltage (which is a little late).

There was NO XUDF message.

Then I plugged in the AC wall charger and started. Experimenting with
USB charging and AC charging in both cases did show some XUDF errors...

So it has something to do with an almost empty battery. At least on my
device.

> 
>> 
>> Ah, this brings me to another idea. Can you break in the u-boot console
>> and use the i2c commands to study if u-boot can communicate on
>> this i2c bus?
> I do have a serial console, so I will give this a try. I don't expect
> there to be any difference, but as you pointed out, it is always better
> to have data than expectations...

Yes it would help to gain some insight for the question if it is a kernel
or a hardware issues. I had that several times over the years.

> 
>> 
>>> 
>>> What happens on your unit when the battery is removed and it is powered
>>> via the barrel jack? If this is related to the battery being unable to
>>> provide enough power this should presumably keep the symptoms from
>>> manifesting by running the whole system in constant voltage mode...
>> 
>> Haven't tried yet but will do asap. Most likely it will behave as with
>> a 100% charged battery. Expectations are good, but experiments are better :)

FInally I booted with AC connected and then removed the battery.
This made the Pandora turn off. But reboot worked. Interestingly the XUDF
messages already came during boot, before login:and without /root/high-load.

Anyways I looked up what code emits the XUDF message:

https://elixir.bootlin.com/linux/latest/source/drivers/i2c/busses/i2c-omap.c#L956

This indicates a workaround for some silicon errata (i462) of the omap3430...

It is not in the official omap3530 errata:
https://www.ti.com/lit/er/sprz278f/sprz278f.pdf?ts=1645361405177

But there seems to be a similar errata: Advisory 3.1.1.155

> Details: The I2C is configured as master transmitter. After serving a XRDY/XDR interrupt (FIFO empty), from the data sent on OCP, one, two or several bytes sent from the memory to the I2C interface are lost. The bytes lost are always the first transmitted on the OCP, when serving the XRDY/XDR interrupts. The occurrence of the bug is related to the coincidence of the moment when data is sent on the OCP and the moment when the most significant bit of a byte is sent on the I2C, always when starting serving the XRDY/XDR interrupt.
> 
> Ideally, no data should be lost when transmitted from the OCP to the I2C. However, one, two or several bytes at the beginning of a transmission from the OCP to I2C are lost, if the moment when they are put on the OCP coincides with the transmission of the most significant bit of a byte on the I2C.
> 
> Workaround(s): A workaround exists for the interrupt mode of operation. Before serving the XRDY/XDR interrupts, until also XUDF status bit is set. This marks the clearance of the internal shift register and polls the Status Register – XUDF bit, from the local host, after receiving an XRDY or XDR Note. For the data transmission using DMA, there is no available software workaround.


It may be a power-supply noise influenced effect. Maybe some flipflop
where the setup and hold times are not fully taken into account.

But this should not make your i2c completely stuck.

>> 
>>> 
>>>> 
>>>> So it seems to be nothing we can solve by bisecting for a kernel bug.
>>>> 
>>>> I just started to cross-check letux-5.17-rc4. At the moment it only shows
>>>> the
>>>> 
>>>> [  330.002105] ti-soc-thermal 48002524.bandgap: eocz timed out waiting high
>>>> 
>>>> This does not appear to depend on input_current_limit or anything else.
>>>> But it also occurs in a high-load situation.
>>>> 
>>>> This looks like a "real bug" - which hopefully can be bisected more
>>>> repeatable.
>> 
>> Yes, this was easily bisected and seems to be this issue:
>> 
>> # first bad commit: [514cbabb01422d501d533a6495b924e4c22d4822] thermal: ti-soc-thermal: Simplify polling with iopoll
>> 
>> What I suspect here is that the minimal waiting time mentioned in the commit
>> message may not be enough for the omap3530-600MHz models but for all other
>> OMAP variants.
> That does sound likely.

I have looked into this and the result makes me smiling.
There was no error test in the old code.

So likely the timeout was there for a long time but did not become visible before.

Still the question remains why it only occurs on the oma3530-600MHz. Potentially the
waiting times are too small. Having open the omap3530 errata I could not find one
related.

But it is described in the TRM in section 7.4.6.2.1 Single Conversion Mode (CONTCONV = 0)
and Fig. 7-15.

According to this the SOC bit should be set to 1 and then we have to wait 11-14 cycles
of the 32768Hz clock (which is ca. 500µs) and then EOCZ should go high. After conversion
ends it goes low and the data is valid (was 4 cycles before but that is not relevant).

What comes to my mind is that this readout is not locked for multiple threads.
AFAIK the bandgap ADC is used by multiple kernel-internal and external clients.
So if one starts a conversion and a second one tries to start while the first is
already running, these tests may interfere. This would happen more often if I
read out through /sysfs in a loop.

BTW: there is a spinlock for reading the result in ti_bandgap_read_temperature()
but not covering the trigger of a readout.

BR,
Nikolaus



More information about the Letux-kernel mailing list