[Letux-kernel] [PATCH 2/5] w1: omap-hdq: fix interrupt handling which did show spurious timeouts

Sat May 9 18:26:24 CEST 2020

Hi Andreas,

> Am 08.05.2020 um 22:53 schrieb Andreas Kemnade <andreas at kemnade.info>:
> 
> Hi,
> 
> On Fri,  8 May 2020 16:28:17 +0200
> "H. Nikolaus Schaller" <hns at goldelico.com> wrote:
> 
>> after
>> 
>> commit 27d13da8782a ("w1: omap-hdq: Simplify driver with PM runtime autosuspend")
>> 
>> was applied, we did see timeouts and wrong values when
>> reading a bq27000 connected to hdq of the omap3. This occurred
>> mainly after boot and sometimes settled down after several reads
>> indicating ignored interrupts.
>> 
>> root at letux:~# time cat /sys/class/power_supply/bq27000-battery/uevent
>> POWER_SUPPLY_NAME=bq27000-battery
>> POWER_SUPPLY_STATUS=Discharging
>> POWER_SUPPLY_PRESENT=1
>> POWER_SUPPLY_VOLTAGE_NOW=0
>> POWER_SUPPLY_CURRENT_NOW=0
>> POWER_SUPPLY_CAPACITY=0
>> POWER_SUPPLY_CAPACITY_LEVEL=Normal
>> POWER_SUPPLY_TEMP=-2731
>> POWER_SUPPLY_TIME_TO_EMPTY_NOW=0
>> POWER_SUPPLY_TIME_TO_EMPTY_AVG=0
>> POWER_SUPPLY_TIME_TO_FULL_NOW=0
>> POWER_SUPPLY_TECHNOLOGY=Li-ion
>> POWER_SUPPLY_CHARGE_FULL=0
>> POWER_SUPPLY_CHARGE_NOW=0
>> POWER_SUPPLY_CHARGE_FULL_DESIGN=0
>> POWER_SUPPLY_CYCLE_COUNT=0
>> POWER_SUPPLY_ENERGY_NOW=0
>> POWER_SUPPLY_POWER_AVG=0
>> POWER_SUPPLY_HEALTH=Good
>> POWER_SUPPLY_MANUFACTURER=Texas Instruments
>> 
>> real    0m15.761s
>> user    0m0.001s
>> sys     0m0.025s
>> root at letux:~#
>> 
>> Sometimes the effect did disappear after trying multiple
>> times and speed went up and results became correct.
>> 
>> Enabling debugging revealed that there were tx and
>> rx timeouts, i.e. the driver does not always respond
>> properly to interrupts.
>> 
>> This patch improves interrupt handling to avoid races
>> and loss of interrupt flags.
>> 
>> The ideas are:
>> * only the hdq_isr() sets bits in hdq_status
>> * and does wake_up()
>> * bits are only reset by the read/write/break functions
>>  if they were waited for
>> * rx/tx/timeout bits are completely decoupled from each
>>  other (and not reset all after waiting for one of them)
> 
> how should that get mixed up? Don't we have a
> break-write-read or break-write-write sequence everywhere which
> must be somehow atomic, like on the i2c bus a
> start-chipaddress-data-data...-stop must be atomic?
> Maybe we are fixing symptoms here? Is the real problem just a layer up?

The isr reads the register for all 3 bits and stores the
state in hdq_status. This then resets all 3 bits in the interrupt
status register of the hdq controller.

If there is a second interrupt (for a different bit) before all
bits in hdq_status have been processed, the original bit
is no longer found in hdq_status.

This is why I make the hdq_status set only.

And the hdq_read/write processing resets only those bits that
have been processed.

> 
>> * which bits to reset is specified by a new parameter
>>  to hdq_reset_irqstatus()
>> * hdq_reset_irqstatus() also returns the state before
>>  resetting so that we can encapsulate the spinlock
>> * this should now handle the case that the write and read
>>  are both already finished quickly before the hdq_write_byte()
>>  ends. Old code may have reset all status bits making
>>  the next hdq_read_byte() timeout
>> * the spinlock protects the reset of bits in function
>>  hdq_reset_irqstatus() which could be a read-write-modify
>>  problem if the interrupt handler tries to read-modify-write
>>  exactly at the same moment
>> * add mutex protection also for hdq_write_byte() just to
>>  be safe not to disturb a hdq_read_byte() triggered by
>>  some other thread/process.
>> 
> And again, this looks like some locking needs to be done one layer up?

There is locking. hqd_read/write/break are locked by a mutex.
And modifications of the hdq_status are now spin-locked.

Indeed some locking was missing but should now be complete.

> 
> In summary: I do not think this worsens anything, so at least it should
> be a starting point for discussion.
> 
> Regards,
> Andreas

BR and thanks,
Nikolaus