[Letux-kernel] [PATCH 2/5] w1: omap-hdq: fix interrupt handling which did show spurious timeouts

Fri May 8 22:53:15 CEST 2020

Hi,

On Fri,  8 May 2020 16:28:17 +0200
"H. Nikolaus Schaller" <hns at goldelico.com> wrote:

> after
> 
> commit 27d13da8782a ("w1: omap-hdq: Simplify driver with PM runtime autosuspend")
> 
> was applied, we did see timeouts and wrong values when
> reading a bq27000 connected to hdq of the omap3. This occurred
> mainly after boot and sometimes settled down after several reads
> indicating ignored interrupts.
> 
> root at letux:~# time cat /sys/class/power_supply/bq27000-battery/uevent
> POWER_SUPPLY_NAME=bq27000-battery
> POWER_SUPPLY_STATUS=Discharging
> POWER_SUPPLY_PRESENT=1
> POWER_SUPPLY_VOLTAGE_NOW=0
> POWER_SUPPLY_CURRENT_NOW=0
> POWER_SUPPLY_CAPACITY=0
> POWER_SUPPLY_CAPACITY_LEVEL=Normal
> POWER_SUPPLY_TEMP=-2731
> POWER_SUPPLY_TIME_TO_EMPTY_NOW=0
> POWER_SUPPLY_TIME_TO_EMPTY_AVG=0
> POWER_SUPPLY_TIME_TO_FULL_NOW=0
> POWER_SUPPLY_TECHNOLOGY=Li-ion
> POWER_SUPPLY_CHARGE_FULL=0
> POWER_SUPPLY_CHARGE_NOW=0
> POWER_SUPPLY_CHARGE_FULL_DESIGN=0
> POWER_SUPPLY_CYCLE_COUNT=0
> POWER_SUPPLY_ENERGY_NOW=0
> POWER_SUPPLY_POWER_AVG=0
> POWER_SUPPLY_HEALTH=Good
> POWER_SUPPLY_MANUFACTURER=Texas Instruments
> 
> real    0m15.761s
> user    0m0.001s
> sys     0m0.025s
> root at letux:~#
> 
> Sometimes the effect did disappear after trying multiple
> times and speed went up and results became correct.
> 
> Enabling debugging revealed that there were tx and
> rx timeouts, i.e. the driver does not always respond
> properly to interrupts.
> 
> This patch improves interrupt handling to avoid races
> and loss of interrupt flags.
> 
> The ideas are:
> * only the hdq_isr() sets bits in hdq_status
> * and does wake_up()
> * bits are only reset by the read/write/break functions
>   if they were waited for
> * rx/tx/timeout bits are completely decoupled from each
>   other (and not reset all after waiting for one of them)

how should that get mixed up? Don't we have a
break-write-read or break-write-write sequence everywhere which
must be somehow atomic, like on the i2c bus a
start-chipaddress-data-data...-stop must be atomic?
Maybe we are fixing symptoms here? Is the real problem just a layer up?

> * which bits to reset is specified by a new parameter
>   to hdq_reset_irqstatus()
> * hdq_reset_irqstatus() also returns the state before
>   resetting so that we can encapsulate the spinlock
> * this should now handle the case that the write and read
>   are both already finished quickly before the hdq_write_byte()
>   ends. Old code may have reset all status bits making
>   the next hdq_read_byte() timeout
> * the spinlock protects the reset of bits in function
>   hdq_reset_irqstatus() which could be a read-write-modify
>   problem if the interrupt handler tries to read-modify-write
>   exactly at the same moment
> * add mutex protection also for hdq_write_byte() just to
>   be safe not to disturb a hdq_read_byte() triggered by
>   some other thread/process.
> 
And again, this looks like some locking needs to be done one layer up?

In summary: I do not think this worsens anything, so at least it should
be a starting point for discussion.

Regards,
Andreas