Message boards : Graphics cards (GPUs) : Working Unit Hanging...different than others reported?
Author | Message |
---|---|
Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues. | |
ID: 9926 | Rating: 0 | rate: / Reply Quote | |
Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues. Ok, we KNOW that 6.6.20 has hanging work units badly. I have seen it with other versions. THe problem is that we do NOT know what is causing this so there is no way to tell for sure what version the problem was introduced in... Or to put it another way, you could be seeing the earliest occurence of this bug. Try a reboot and if it is the 6.6.20 problem you will likely start to see an increase in speed. USUALLY you will start to see the time to completion drop several seconds per second if this WAS the long run bug... | |
ID: 9929 | Rating: 0 | rate: / Reply Quote | |
I returned home 2 hours ago and see another so called big unit being stuck at 7,480 % for a long time at least the 2 hours i am at home. | |
ID: 9931 | Rating: 0 | rate: / Reply Quote | |
Drat...I had already aborted that one before I thought about it being potentially different from the already known problem (or potentially informative as a earlier version example). That machine has already completed another unit--a RAUL unit--in typical time, without a restart. | |
ID: 9934 | Rating: 0 | rate: / Reply Quote | |
Scott, that is what is making this bug so much fun. 6.6.20 was unique in that it affected nearly 50% of the tasks I ran on it. I think i have seen it on 6.6.23 or .25 ... not sure ... but, there are those odd long tasks so sometimes it is hard to know for sure until they are done. Sadly you cannot always tell by the names ... or I can't remember the key ... At any rate, it is still on my list of things to be looked for ... I found one more pointer today, not that it will do much good ... | |
ID: 9935 | Rating: 0 | rate: / Reply Quote | |
| |
ID: 9965 | Rating: 0 | rate: / Reply Quote | |
Have you just tried stopping are restarting BOINC? | |
ID: 9968 | Rating: 0 | rate: / Reply Quote | |
You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem". | |
ID: 9973 | Rating: 0 | rate: / Reply Quote | |
You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem". I did not think that I said it was ... I said we don't know enough ... and that it is a possibility ... in the mean time, try these things ... :) The more I dig into the Resource Scheduler and ponder the implications of the code buried therein the less sanguine I get about how this system works. Richard Haselgrove has documented a problem on SaH where the CUDA tasks are started before they are initialized... and the task of course immediately crashes ... also not this problem, but it is a flaw in the way resources and tasks are managed. I am trying to gather data for another error I am calling "silent restart" which may or may not be related to long running tasks. The fundamental problem is that there is too much we do not know ... and too many times this last month I have dug deep into a problem and found that the error is one that has been plaguing us for years. Meaning what? Meaning that though superficial changes in some of the code may cause problems of long standing nature to slip in and out of view. The part of code that I am worried about has not changed in a long time that I know of... which means that what was a disaster in 6.6.20 may only be a mild annoyance in other versions ... but the bug is still there... As proof of my case, the "no heartbeat" and "too many restarts" have been longstanding problems where people would lose tasks and we had been pulling our hair out trying to figure out what was causing these errors ... well, I now know of two different potential causes. Neither are related to the tasks that were being mangled. Or to put it another way, we were looking in the wrong places... {edit} As an example of how this can happen, 6.4.7 (or actually any version of BOINC) and specific tasks, and specific card ... task hits particular point in the loops and takes slightly longer to get through the loop than expected. Science Application does not send heartbeat message in time, BOINC Shoots Application and relaunches at prior checkpoint which means that you could see very little advancement of the task because the task was being killed and restarted all the time. One quick way to see if this is happening is to watch the PID of the processes running under BOINC, if the one for GPU Grid keeps increasing then ... (you have to turn on the additional column in the View menu of Task Manager for windows). Again, we don't know why 6.6.20 caused many tasks to seemingly run forever, and though here we are concerned with the GPU, I also have experience with a system where it was happening to a CPU class task. And I am pretty sure I saw it on a task run with 6.6.23 ... | |
ID: 9978 | Rating: 0 | rate: / Reply Quote | |
OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;) | |
ID: 10017 | Rating: 0 | rate: / Reply Quote | |
OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;) No worries ... I did not take offense or get bugged... :) Just trying to keep us all on the same page... Though my input is discounted, debugging software is something I have been doing for about 34 some year. All the way from assembly language programs to ADA. Even when I play computer games I rarely play to win, I usually play to kill time and to learn how the AI cheats ... What concerns me with this area of code is that I suspect that the same fundamental flaw is presenting itself under different guises... so we see what we think are 3-4 problems and it is really one flaw. The problem is that our diagnostic tools are very limited and hard to use. Anyway, onward ... | |
ID: 10036 | Rating: 0 | rate: / Reply Quote | |
Hello, | |
ID: 10182 | Rating: 0 | rate: / Reply Quote | |
All tasks have a drop dead time ... the problem is if the task or machine is hung this may or may not be detected. WIthout knowing more it is hard to say what is going on. If the task is "running" it will eventually get to the drop dead time and quit as running too long. | |
ID: 10191 | Rating: 0 | rate: / Reply Quote | |
OK. | |
ID: 10193 | Rating: 0 | rate: / Reply Quote | |
OK. The short answer is that I am not sure. It is not a set time per se, but the max CPU time exceeded and is a function of the speed of the system. I don't know what option the project selected here because this is about the first time we have had this issue ... If you have BAM access you could detach and reattach through BAm and that would clear the task ... since this is a remote machine ... only thing I can think of so you can get back to being productive. I cannot remember if you can do a project reset through BAM or not ... the other option to try if it is available. | |
ID: 10197 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : Working Unit Hanging...different than others reported?