Message boards : Graphics cards (GPUs) : hERG: information and issues
Author | Message |
---|---|
Dear crunchers, | |
ID: 13850 | Rating: 0 | rate: / Reply Quote | |
Crunching issues. | |
ID: 13851 | Rating: 0 | rate: / Reply Quote | |
ID: 13869 | Rating: 0 | rate: / Reply Quote | |
Crunching issues. The TONI_HERG run fine on GTX 260 and above. On my 4 G92 based cards they almost always fail, so I now abort them on those cards when they arrive. Other WUs are much much better, most types never fail on any of the cards. | |
ID: 13870 | Rating: 0 | rate: / Reply Quote | |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. | |
ID: 13875 | Rating: 0 | rate: / Reply Quote | |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. I would put it more strongly than that - they have a high probability of failing, even if some succeed. And by 'age' of the card, you mean the technology generation they incorporate. I have three 9800GT series cards, all purchased in January this year. The straight 9800GTs are not overclocked, the 9800GTX+ runs on factory overclock settings. I haven't noteiced any significant difference in failure rate between the cards: so I don't think the problem is related to (moderate) overclocking. Also, I've been running the same drivers (190.38, 32-bit WinXP) since July: the increased error rate has become apparent much more recently than that - late October, IIRC. So I'm not inclined to blame it on drivers, either. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468 | |
ID: 13876 | Rating: 0 | rate: / Reply Quote | |
Hello, | |
ID: 13881 | Rating: 0 | rate: / Reply Quote | |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. Yes, 3 tasks did complete on the GTS 250, but there were too many failures. The clock settings are in fact Factory settings, but yes they are higher than other cards, but it is fairly new and the core sits at 66 degrees (5 fans on case, + GPU, CPU and PSU fans) and UPS! The GTS 250 success rates are much higher for other tasks. On the other hand my 8800GTS 512MB G92, could not complete any TONI_HERG tasks. As there were so many being sent I was down to an almost zero return for that card on the project. That card was also not able to handle other recent tasks too well. I guess it is down to the G92 cores limitations. My GTS250 spec: Palit card. 65nm, G92 rev A2. Bios 62.92.7D.00.10 11.9562, CUDA 3 (better than 2.3)! GPU @745, Memory @1000MHz, Shaders @1848MHz 754M Transistors. GPUGrid temp=66 Degrees C For Ref. Einstein temp=48 Degrees C (but that barely uses the GPU)! System: Q9400CPU @3.46GHz crunching other Boinc tasks (24/7, no outages as on UPS) and Win7 Pro 64bit. 4GB RAM plenty HDD space. I will allow it to try another Herg task. Report back tomorrow, hopefully! The GTX260 is still working well for all tasks, but that uses a GT200 A2. | |
ID: 13906 | Rating: 0 | rate: / Reply Quote | |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. As Richard stats, "high probability of failing" is a better description. They will occasionally complete but usually fail. On the GTX 260 and above they run fine. BTW, they often fail on the new GTS 240 and GT 240 cards too even with their 1.2 compute capability: http://www.gpugrid.net/result.php?resultid=1592578 http://www.gpugrid.net/result.php?resultid=1590198 http://www.gpugrid.net/result.php?resultid=1610106 | |
ID: 13925 | Rating: 0 | rate: / Reply Quote | |
My GTS250 managed to complete one! http://www.gpugrid.net/result.php?resultid=1625604 | |
ID: 13927 | Rating: 0 | rate: / Reply Quote | |
We are keeping eyes on the failure rate wrt card types (in absence of overclock). As said, the matter is puzzling because there should be no major difference with other WU types. For now, I reduced the number of HERG WUs out, and possibly I'll reduce their length a bit in order to increase the chances of correct termination. | |
ID: 13947 | Rating: 0 | rate: / Reply Quote | |
Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels). Could you give us a little bit more detail about this bug, as this is the first time I've heard about it? It may only be "infamous" in developer circles. I'm aware of an infamous bug in the BOINC CUDA application which NVidia developed for SETI@home, but that just causes certain tasks ('VLAR') to run extremely slowly, and inhibits screen re-drawing while they're running. Apart from that, SETI is an extremely heavy user of FFTs at a wide range of problem sizes, and benefits enormously from the additional capabilities of cufft v2.3: I've not come across a single SETI task which has failed because of a CUDA FFT bug. | |
ID: 13951 | Rating: 0 | rate: / Reply Quote | |
It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them. | |
ID: 13953 | Rating: 0 | rate: / Reply Quote | |
It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them. I had hoped that you would direct me to a relevant discussion here. The only thing of relevance in those threads seems to be message 12734: We have contacted AGAIN Nvidia yesterday. That was almost three months ago, and is the very last post in the thread. Did he ever get a reply? | |
ID: 13954 | Rating: 0 | rate: / Reply Quote | |
Perhaps the FFT bug is being compounded by a mixture of G92/65nm cores and old firmware? | |
ID: 13955 | Rating: 0 | rate: / Reply Quote | |
great to see this thread!! thanks a lot! | |
ID: 13957 | Rating: 0 | rate: / Reply Quote | |
I can just repeat what I have already said somewhere in the forum. | |
ID: 13982 | Rating: 0 | rate: / Reply Quote | |
Had two TONI_HERG's fail. They were run on a GTX295 (single PCB variety, so the newer model). | |
ID: 14077 | Rating: 0 | rate: / Reply Quote | |
I can just repeat what I have already said somewhere in the forum. I've downloaded the Nvidia SDKs for the older CUDA versions. Are you interested in sending me the source code for the current Windows application and letting me check if whatever method you use to compile it also works with the older SDKs? Or would you prefer to download those SDKs yourself? I'd expect either method to produce versions with better support for some of the older Nvidia boards, IF they don't need major source code modifications to work at all. I intended to start learning enough CUDA that I could start helping a few BOINC projects start a GPU version, but so far it looks like I won't be ready to actually start modifying the code very soon. Another idea: Ask the BOINC developers to add more code for reporting the GPU chip type, in order to get more information about which of the older Nvidia boards are still usable. | |
ID: 14159 | Rating: 0 | rate: / Reply Quote | |
First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page. Since that means your software is now ready to handle a tetramer, here's some information on a trimer you're likely to be interested in as well: A trimer of the gp120 protein that the HIV-1 virus uses to enter human cells. If your software can handle docking of assorted compounds the that trimer and choose those that dock to the trimer without too much being wasted also docking to the single units of the gp120 protein elsewhere on the virus coat, you're likely to get the groups interested in HIV/AIDS research very interested in using your software. At this moment, I'm having trouble getting the links from one of my other computers to this one, but will post several related links if they look useful for you. Atre you interested in getting enough grants that you will have to hire yet another researcher or two to handle them all? | |
ID: 14160 | Rating: 0 | rate: / Reply Quote | |
Well if the problem with the Toni work units can't be solved a work around better be made. In other projects the participants can chose what work they want to do. Let people chose the Toni if they want to, for instance if knowing that their hardware does not fail them. It is reasonable to give a slightly higher credit on problematic work. I you want to experiment with new work units, make it a voluntary choice. | |
ID: 14163 | Rating: 0 | rate: / Reply Quote | |
Jari Pyyluoma, | |
ID: 14189 | Rating: 0 | rate: / Reply Quote | |
Thanx, that sounds like great news. I just happened to get some nvidia cards from a friend. Yes, I also have the feeling that I should pass them on, and now you have proven it with numbers. | |
ID: 14198 | Rating: 0 | rate: / Reply Quote | |
Sony apparently stopped allowing the use of Linux on the PS3, required to run GPUGrid. Lots of G92 cards are struggling due to a number of things outside the control of the project, including the reliance on NVidia for code. If there is a bug in their code and the project team are not allowed to correct it, there is nothing they can do. If the project team was larger I am sure they would be better equipped to make more changes, perhaps even write tasks for the older cards, but as is things appear to be tight. | |
ID: 14201 | Rating: 0 | rate: / Reply Quote | |
I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code. | |
ID: 14235 | Rating: 0 | rate: / Reply Quote | |
I'm sending a few more workunits of the HERG type. In the meanwhile, if you want to see images of what you are crunching, have a look at the flickr page! | |
ID: 14924 | Rating: 0 | rate: / Reply Quote | |
I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code. World Computing Grid allows participants to choose workunits by making different types of workunits different subprojects and allowing the participants to choose which subprojects to run. Any particular reason why you can't do the same, even if it requires providing the same application program under more than one name? | |
ID: 14928 | Rating: 0 | rate: / Reply Quote | |
Well, bugs may come and go with cards, drivers, and versions of the application. We try to make all WUs run equally well, rather than fork (and maintain) separate queues. | |
ID: 14930 | Rating: 0 | rate: / Reply Quote | |
I also have another compute error with this type of workunit. | |
ID: 14951 | Rating: 0 | rate: / Reply Quote | |
As soon as the new application will be available, I'll migrate the workunits to it, crossing fingers it will improve the situation. | |
ID: 14968 | Rating: 0 | rate: / Reply Quote | |
Feel free to see the new molecular images on flickr... | |
ID: 14969 | Rating: 0 | rate: / Reply Quote | |
The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience... | |
ID: 14981 | Rating: 0 | rate: / Reply Quote | |
I have had two crash and no sucess on a stable card. | |
ID: 15000 | Rating: 0 | rate: / Reply Quote | |
Which of the two hosts? | |
ID: 15009 | Rating: 0 | rate: / Reply Quote | |
Another compute error with a GTX295 on this computer . | |
ID: 15017 | Rating: 0 | rate: / Reply Quote | |
Does not seem to be HERG specific, you also had an error on task 1816245 of another batch. | |
ID: 15018 | Rating: 0 | rate: / Reply Quote | |
.. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468 Pleased to report that one of my 9800GT cards has successfully completed a TONI_HERG qext. | |
ID: 15052 | Rating: 0 | rate: / Reply Quote | |
Good! | |
ID: 15055 | Rating: 0 | rate: / Reply Quote | |
D1s30c47-TONI_HERGqext-2-60-RND0387_0 | |
ID: 15057 | Rating: 0 | rate: / Reply Quote | |
Thanks, well spotted. I tried to replace a parameter on the fly. Please post if it happens again. | |
ID: 15058 | Rating: 0 | rate: / Reply Quote | |
The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience... I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext. Time per step: 62.932 ms Example The TONI_HERGext running only ~6,5h Time per step: 37.026 ms Example "slightly? :-) longer computation" | |
ID: 15060 | Rating: 0 | rate: / Reply Quote | |
I also noticed the increase, and that was higher than expected. This is what I was trying to fix... | |
ID: 15061 | Rating: 0 | rate: / Reply Quote | |
I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32: | |
ID: 15735 | Rating: 0 | rate: / Reply Quote | |
Does anyone have any idea on these? | |
ID: 15812 | Rating: 0 | rate: / Reply Quote | |
I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks. | |
ID: 15814 | Rating: 0 | rate: / Reply Quote | |
Another slipped in while I was asleep: | |
ID: 15823 | Rating: 0 | rate: / Reply Quote | |
here's a bad WU: | |
ID: 15966 | Rating: 0 | rate: / Reply Quote | |
This bad one seems to have been created by some file transfer error. It should fail immediately. | |
ID: 15969 | Rating: 0 | rate: / Reply Quote | |
I have now had 3 in a row fail. | |
ID: 16173 | Rating: 0 | rate: / Reply Quote | |
I have one similar error after crunching 13 hours. | |
ID: 16195 | Rating: 0 | rate: / Reply Quote | |
A GTS250 is very similar (almost identical) to a 9800 GTX+ | |
ID: 16197 | Rating: 0 | rate: / Reply Quote | |
After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp. | |
ID: 16220 | Rating: 0 | rate: / Reply Quote | |
During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped | |
ID: 16313 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : hERG: information and issues