[lava][lava-docker] Jobs deadlock

Description

Take this job:

https://lava.automotivelinux.org/scheduler/job/332

 

It gets stuck half-way and stays like this for >23h.

Isn't there a global job timeout ?
Why does the job lock-up. My hinch is a packet drop on the zmq pipe to the master that causes the connection to reset somehome. Then the master cannot re-aquire the correct job state / log (or vice versa, the slave possibly terminates correctly but it is not captured by the master).

Result is that a long queue builds-up in jenkins for test-jobs waiting for these to finish.

 

Environment

None

Attachments

2

Activity

Show:

Corentin Labbe 
October 23, 2018 at 11:52 AM

Fixed in next upstream version.

Corentin Labbe 
October 23, 2018 at 11:52 AM

Upstream think they have found the reason:

https://lists.lavasoftware.org/pipermail/lava-users/2018-October/001328.html

We will have the fix in the next release.

Ryan Day 
September 25, 2018 at 3:57 PM
(edited)

Jan-Simon Moeller 
September 25, 2018 at 3:56 PM

Lab:

Corentin Labbe 
September 25, 2018 at 9:52 AM

This is the upstream report.

https://lists.linaro.org/pipermail/lava-users/2018-September/001264.html

Could I post to this list your job link ?

Could you confirm the version by running  dpkg -l |grep lava inside docker ? (shoud be 2018.7-1+stretch)

Fixed

Details

Assignee

Reporter

Labels

Contract ID

Components

Priority

Created September 22, 2018 at 8:53 PM
Updated December 7, 2018 at 8:56 PM
Resolved October 23, 2018 at 11:52 AM