Message boards : Wish list : Revamping the failed task routine
Author | Message |
---|---|
Perhaps at some stage the present task allocation and reporting system could be developed to incorporate the ability to use partially completed (failed) tasks? | |
ID: 18549 | Rating: 0 | rate:
![]() ![]() ![]() | |
Wouldn't that require a full list of all hardware components, drivers used, software installed & a log of every single process prior to the failure? Even after that, wouldn't the brand & revision, clock rates, voltages & temperatures, age & condition still be factors that also need to be known? | |
ID: 18551 | Rating: 0 | rate:
![]() ![]() ![]() | |
It's just a series of calculation, so I don't think so; it does not matter which calculator you use, just that you start from the correct palce. So it may be possible to pick it up from the last checkpoint. | |
ID: 18555 | Rating: 0 | rate:
![]() ![]() ![]() | |
Oops, my bad! I totally misunderstood. I though it was some kind of exaggerated log of everything that happened prior to failure. But something curious I've noticed once was a WU I aborted that my GPU couldn't complete in time that was at 7% that started on a new WU from 7%, it failed of course. But what made me curious was that it took 30 minutes to fail. Those were two totally different WU's. If it ran all the way, & I first got a message that there was an error after my GPU ran junk all the way up to 100%. That would be an even bigger waste. Would there be any chance of that? | |
ID: 18557 | Rating: 0 | rate:
![]() ![]() ![]() | |
As far as I know, there is a lot of randomization takes place in this kind of simulaton, so if you start over a (failed) task with any overlap (even from the last checkpoint), the overlapping part wouldn't be the same as the original one. Therefore this method cannot serve as a stability test between different PC+GPU+OS+driver systems. Maybe the same system would't fail, if the the client restarts from the last checkpiont, or reboots the OS and then restarts the task from the checkpoint. So there is no point to have any bigger overlap than the last checkpoint. But sending back the partial result (with a detailed error report), and receiving a proportional credit for it (without the time bonus) would be nice and useful though (rosetta@home works this way). The real problem is the further processing of the partial result. That's what only the project developers know how difficult to make it working, or it's worth the effort at all. | |
ID: 18558 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi all, haven't seen a GPUGrid tasks fail, yet, maybe cause I UPdated to CUDA 3.1 | |
ID: 18559 | Rating: 0 | rate:
![]() ![]() ![]() | |
A 8500GT is too slow; 16 shaders is not enough, especially for a Compute Capable 1.1 card. | |
ID: 18561 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks, | |
ID: 32986 | Rating: 0 | rate:
![]() ![]() ![]() | |
Probably won't do that because it mucks up the post-simulation data analysis if trajectory files are all uneven lengths. MJH | |
ID: 33010 | Rating: 0 | rate:
![]() ![]() ![]() | |
Anyway, the new failure recovery should take care of most situations which the initial suggestion targeted. The main difference is that now the original host itself tries again from the last checkpoint, instead of another host as suggested. | |
ID: 33020 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Wish list : Revamping the failed task routine