Message boards : Number crunching : App restarts after being suspended and restarted.
Author | Message |
---|---|
I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days. | |
ID: 59666 | Rating: 0 | rate: / Reply Quote | |
I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days. The tasks do checkpoint in fact. It takes a few minutes depending on the speed of the system to replay computations back to the last checkpoint. Upon restart the task will display the low % percentage and then jump forward to the last checkpoint percentage. You can check when the last checkpoint was written by viewing the task properties in the Manager sidebar. On Windows hosts I have heard that stopping a task midstream and restarting can often hang the task. You should see this verbage repeating over and over Starting!! Define rollouts storage Define scheme Created CWorker with worker_index 0 Created GWorker with worker_index 0 Created UWorker with worker_index 0 Created training scheme. Define learner Created Learner. Look for a progress_last_chk file - if exists, adjust target_env_steps Define train loop 11:06:52 (6450): wrapper (7.7.26016): starting 11:06:54 (6450): wrapper (7.7.26016): starting 11:06:54 (6450): wrapper: running bin/python (run.py) for every restart in the stderr.txt file in the slot that the running task occupies, then the task is likely hung and you can either try restarting the host and BOINC to see if you can persuade it back into running or abort it and get another task and try not to interrupt it. | |
ID: 59667 | Rating: 0 | rate: / Reply Quote | |
It also writes logs to wrapper_run.out | |
ID: 59669 | Rating: 0 | rate: / Reply Quote | |
jm7 - It's not the checkpoints that are the problem. You tasks are failing with multiple different errors. Your system needs a GPU driver update and a few tune-up adjustments. | |
ID: 59674 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : App restarts after being suspended and restarted.