Advanced search

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,331,546,800
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53165 - Posted: 27 Nov 2019 | 20:09:31 UTC
Last modified: 27 Nov 2019 | 20:09:54 UTC

Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem.

I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way.


Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53173 - Posted: 27 Nov 2019 | 22:50:09 UTC - in response to Message 53165.
Last modified: 27 Nov 2019 | 22:51:22 UTC

Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem.

I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way.
The reason for this error is in the stderr output of the task:
<core_client_version>7.16.1</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 09:41:49 (11866): wrapper (7.7.26016): starting 09:41:49 (11866): wrapper (7.7.26016): starting 09:41:49 (11866): wrapper: running acemd3 (--boinc input --device 1) 13:57:59 (13231): wrapper (7.7.26016): starting 13:57:59 (13231): wrapper (7.7.26016): starting 13:57:59 (13231): wrapper: running acemd3 (--boinc input --device 0) ERROR: /home/user/conda/conda-bld/acemd3_1570536635323/work/src/mdsim/context.cpp line 322: Cannot use a restart file on a different device! 13:58:05 (13231): acemd3 exited; CPU time 5.243312 13:58:05 (13231): app exit status: 0x9e 13:58:05 (13231): called boinc_finish(195) </stderr_txt> ]]>
This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app).
To resolve this you should
1. make notes of task-device pairs
2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"])
3. restart your host
4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on)

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,331,546,800
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53174 - Posted: 27 Nov 2019 | 23:29:14 UTC - in response to Message 53173.
Last modified: 27 Nov 2019 | 23:33:28 UTC

This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app).
To resolve this you should
1. make notes of task-device pairs
2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"])
3. restart your host
4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on)


Thanks, was not aware of that! Going to be a real problem as there is a windows 10 "feature 1909" pending. However, ubuntu will be unaffected.

Not sure if you noticed, but my "El Cheapo" P102-100 mining card "D1" is far and away the faster of the 1660Ti "D0" and especially the GTX-1070 "D2"


GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d1) 99.87 02:30:22 (02:30:10) 04:16:50 57.000 Running tb85-nvidia test449-TONI_GSNTEST3-6-100-RND1891_0 12/2/2019 9:53:34 AM JStateson
GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d0) 99.91 02:30:20 (02:30:12) 04:40:43 53.000 Running tb85-nvidia initial_1911-ELISA_GSN4V1-9-100-RND1684_0 12/2/2019 11:52:22 AM JStateson
GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d2) 99.89 02:30:19 (02:30:09) 05:28:30 45.000 Running tb85-nvidia initial_1243-ELISA_GSN4V1-1-100-RND2537_0 12/2/2019 1:44:26 PM JStateson


start time for all 3 above was 2:30:19 within 3 seconds. The mining card will finish an hour ahead of the 1660Ti and 2 hours ahead of the 1070 is my guess

Post to thread

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client

//