hERG: information and issues

Message boards : Graphics cards (GPUs) : hERG: information and issues

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 13850 - Posted: 9 Dec 2009 \| 12:01:16 UTC Last modified: 9 Dec 2009 \| 13:27:53 UTC
	Dear crunchers, I'm starting this topic to collect information and feedback on the HERG workunits, all in a single place. The idea (under test) is to provide a quick-to-find reference for both those of you curious about the purpose of the WU they are crunching, and a place to report issues. This post, and the one below, may be updated from time to time. Scientific rationale. First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page. This complex of four proteins (tetramer) is found in many of the body cells, and most notably the heart tissue, where it plays a very important role: it conducts charged particles (potassium ions), which flow through it cyclically, ultimately governing the heart beat. The molecule is of especial interest because interferences with its functioning, e.g. unintentional side effects of drugs, and congenital mutations, cause potentially fatal alterations in the cardiac rhythm, including the long QT syndrome. The curious ones may find an image of the tetramer on our Flickr photostream.
	ID: 13850 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 13851 - Posted: 9 Dec 2009 \| 12:02:40 UTC Last modified: 9 Dec 2009 \| 18:18:42 UTC
	Crunching issues. The TONI_HERG workunits use the same parameters as many others. As far as we know, they have the same failure rate as other workunits, but I am trying to get some sounder statistics. If you see more HERG failures, it could be that there are many of those WU out right now. [This post reserved for future updates]
	ID: 13851 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13869 - Posted: 10 Dec 2009 \| 18:49:47 UTC - in response to Message 13851.
	http://www.gpugrid.net/forum_thread.php?id=1506
	ID: 13869 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13870 - Posted: 10 Dec 2009 \| 20:11:52 UTC - in response to Message 13851.
	Crunching issues. The TONI_HERG workunits use the same parameters as many others. As far as we know, they have the same failure rate as other workunits, but I am trying to get some sounder statistics. If you see more HERG failures, it could be that there are many of those WU out right now. The TONI_HERG run fine on GTX 260 and above. On my 4 G92 based cards they almost always fail, so I now abort them on those cards when they arrive. Other WUs are much much better, most types never fail on any of the cards.
	ID: 13870 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 13875 - Posted: 11 Dec 2009 \| 11:07:57 UTC - in response to Message 13870. Last modified: 11 Dec 2009 \| 11:16:34 UTC
	So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. From what I see in SKGiven's task list for host 51279, he had at least three TONI_HERG successfully completed, as well 1572466, 1606985 and 1558388. BTW, isn't the card overclocked at 1.85 GHz?
	ID: 13875 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 13876 - Posted: 11 Dec 2009 \| 11:48:07 UTC - in response to Message 13875.
	So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. I would put it more strongly than that - they have a high probability of failing, even if some succeed. And by 'age' of the card, you mean the technology generation they incorporate. I have three 9800GT series cards, all purchased in January this year. The straight 9800GTs are not overclocked, the 9800GTX+ runs on factory overclock settings. I haven't noteiced any significant difference in failure rate between the cards: so I don't think the problem is related to (moderate) overclocking. Also, I've been running the same drivers (190.38, 32-bit WinXP) since July: the increased error rate has become apparent much more recently than that - late October, IIRC. So I'm not inclined to blame it on drivers, either. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468
	ID: 13876 \| Rating: 0 \| rate: / Reply Quote

canardo Send message Joined: 11 Feb 09 Posts: 4 Credit: 8,675,472 RAC: 0 Level Scientific publications	Message 13881 - Posted: 11 Dec 2009 \| 17:36:57 UTC - in response to Message 13875.
	Hello, Just have a look here comp id: 26091 worked fine untill i upgraded to BOINC 6.10.18 allthough it might be coincidence with HERG units coming in SETI & Einstein have no problems though Ciao, Jaak ____________
	ID: 13881 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13906 - Posted: 13 Dec 2009 \| 12:41:58 UTC - in response to Message 13875. Last modified: 13 Dec 2009 \| 13:33:06 UTC
	So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. From what I see in SKGiven's task list for host 51279, he had at least three TONI_HERG successfully completed, as well 1572466, 1606985 and 1558388. BTW, isn't the card overclocked at 1.85 GHz? Yes, 3 tasks did complete on the GTS 250, but there were too many failures. The clock settings are in fact Factory settings, but yes they are higher than other cards, but it is fairly new and the core sits at 66 degrees (5 fans on case, + GPU, CPU and PSU fans) and UPS! The GTS 250 success rates are much higher for other tasks. On the other hand my 8800GTS 512MB G92, could not complete any TONI_HERG tasks. As there were so many being sent I was down to an almost zero return for that card on the project. That card was also not able to handle other recent tasks too well. I guess it is down to the G92 cores limitations. My GTS250 spec: Palit card. 65nm, G92 rev A2. Bios 62.92.7D.00.10 11.9562, CUDA 3 (better than 2.3)! GPU @745, Memory @1000MHz, Shaders @1848MHz 754M Transistors. GPUGrid temp=66 Degrees C For Ref. Einstein temp=48 Degrees C (but that barely uses the GPU)! System: Q9400CPU @3.46GHz crunching other Boinc tasks (24/7, no outages as on UPS) and Win7 Pro 64bit. 4GB RAM plenty HDD space. I will allow it to try another Herg task. Report back tomorrow, hopefully! The GTX260 is still working well for all tasks, but that uses a GT200 A2.
	ID: 13906 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13925 - Posted: 14 Dec 2009 \| 18:53:27 UTC - in response to Message 13875. Last modified: 14 Dec 2009 \| 18:53:55 UTC
	So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. As Richard stats, "high probability of failing" is a better description. They will occasionally complete but usually fail. On the GTX 260 and above they run fine. BTW, they often fail on the new GTS 240 and GT 240 cards too even with their 1.2 compute capability: http://www.gpugrid.net/result.php?resultid=1592578 http://www.gpugrid.net/result.php?resultid=1590198 http://www.gpugrid.net/result.php?resultid=1610106
	ID: 13925 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13927 - Posted: 14 Dec 2009 \| 19:16:26 UTC - in response to Message 13925.
	My GTS250 managed to complete one! http://www.gpugrid.net/result.php?resultid=1625604 The success percentage of these HERG tasks for anything less than a GTX260 seems to be poor, with the older cards being less reliable. Just because an NVidia card is new does not mean there is any new technology inside!
	ID: 13927 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 13947 - Posted: 15 Dec 2009 \| 14:17:10 UTC - in response to Message 13927.
	We are keeping eyes on the failure rate wrt card types (in absence of overclock). As said, the matter is puzzling because there should be no major difference with other WU types. For now, I reduced the number of HERG WUs out, and possibly I'll reduce their length a bit in order to increase the chances of correct termination. Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels). Definitely, thanks for bearing with us.
	ID: 13947 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 13951 - Posted: 15 Dec 2009 \| 17:19:35 UTC - in response to Message 13947.
	Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels). Could you give us a little bit more detail about this bug, as this is the first time I've heard about it? It may only be "infamous" in developer circles. I'm aware of an infamous bug in the BOINC CUDA application which NVidia developed for SETI@home, but that just causes certain tasks ('VLAR') to run extremely slowly, and inhibits screen re-drawing while they're running. Apart from that, SETI is an extremely heavy user of FFTs at a wide range of problem sizes, and benefits enormously from the additional capabilities of cufft v2.3: I've not come across a single SETI task which has failed because of a CUDA FFT bug.
	ID: 13951 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 13953 - Posted: 15 Dec 2009 \| 17:55:52 UTC - in response to Message 13951. Last modified: 15 Dec 2009 \| 18:02:19 UTC
	It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them.
	ID: 13953 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 13954 - Posted: 15 Dec 2009 \| 18:49:36 UTC - in response to Message 13953.
	It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them. I had hoped that you would direct me to a relevant discussion here. The only thing of relevance in those threads seems to be message 12734: We have contacted AGAIN Nvidia yesterday. gdf That was almost three months ago, and is the very last post in the thread. Did he ever get a reply?
	ID: 13954 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13955 - Posted: 15 Dec 2009 \| 20:08:41 UTC - in response to Message 13954.
	Perhaps the FFT bug is being compounded by a mixture of G92/65nm cores and old firmware? Reducing the work length would help, as the tasks that failed on my systems seemed to do so randomly, in terms of time. If they fail after 10sec its not really a problem that effects turnover, but after 6h is not good. Ultimately if you could match cards to work units it would resolve this issue. It might even be better than card pairing, though both could be done. No hERG tasks for G92 cards would soon sort a lot of problems out.
	ID: 13955 \| Rating: 0 \| rate: / Reply Quote

Tom Philippart Send message Joined: 12 Feb 09 Posts: 57 Credit: 23,376,686 RAC: 0 Level Scientific publications	Message 13957 - Posted: 15 Dec 2009 \| 20:50:07 UTC
	great to see this thread!! thanks a lot!
	ID: 13957 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13982 - Posted: 18 Dec 2009 \| 9:49:06 UTC - in response to Message 13957. Last modified: 18 Dec 2009 \| 9:49:35 UTC
	I can just repeat what I have already said somewhere in the forum. We have furnished a reproducer of the bug to Nvidia. We contacted them back several times. They say that there they are looking at it. Another time, they said that technical stuff is trying to find the problem and the are discussions on what to do. But then nothing. This is common with Nvidia, we have sent several bug reproducers but they only fixed once another other bug with their FFT which we have sent. In my experience, they use bug reports to fix bugs on new chips not older ones. It also makes some sense given the rate at which new GPUs are produced. So we have stopped reporting bugs for older cards. GDF
	ID: 13982 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 14077 - Posted: 29 Dec 2009 \| 23:57:17 UTC
	Had two TONI_HERG's fail. They were run on a GTX295 (single PCB variety, so the newer model). WU 1 WU 2 Both say "Cuda error: Kernel [pme_fill_charges_overflow] failed in file 'fillcharges.cu' in line 97 : unknown error". I know there isn't much you can do if nvidia don't want to fix their software. ____________ BOINC blog
	ID: 14077 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,426,089 RAC: 210,585 Level Scientific publications	Message 14159 - Posted: 8 Jan 2010 \| 15:59:04 UTC - in response to Message 13982.
	I can just repeat what I have already said somewhere in the forum. We have furnished a reproducer of the bug to Nvidia. We contacted them back several times. They say that there they are looking at it. Another time, they said that technical stuff is trying to find the problem and the are discussions on what to do. But then nothing. This is common with Nvidia, we have sent several bug reproducers but they only fixed once another other bug with their FFT which we have sent. In my experience, they use bug reports to fix bugs on new chips not older ones. It also makes some sense given the rate at which new GPUs are produced. So we have stopped reporting bugs for older cards. GDF I've downloaded the Nvidia SDKs for the older CUDA versions. Are you interested in sending me the source code for the current Windows application and letting me check if whatever method you use to compile it also works with the older SDKs? Or would you prefer to download those SDKs yourself? I'd expect either method to produce versions with better support for some of the older Nvidia boards, IF they don't need major source code modifications to work at all. I intended to start learning enough CUDA that I could start helping a few BOINC projects start a GPU version, but so far it looks like I won't be ready to actually start modifying the code very soon. Another idea: Ask the BOINC developers to add more code for reporting the GPU chip type, in order to get more information about which of the older Nvidia boards are still usable.
	ID: 14159 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,426,089 RAC: 210,585 Level Scientific publications	Message 14160 - Posted: 8 Jan 2010 \| 17:46:38 UTC - in response to Message 13850.
	First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page. This complex of four proteins (tetramer) is found in many of the body cells, and most notably the heart tissue, where it plays a very important role: it conducts charged particles (potassium ions), which flow through it cyclically, ultimately governing the heart beat. Since that means your software is now ready to handle a tetramer, here's some information on a trimer you're likely to be interested in as well: A trimer of the gp120 protein that the HIV-1 virus uses to enter human cells. If your software can handle docking of assorted compounds the that trimer and choose those that dock to the trimer without too much being wasted also docking to the single units of the gp120 protein elsewhere on the virus coat, you're likely to get the groups interested in HIV/AIDS research very interested in using your software. At this moment, I'm having trouble getting the links from one of my other computers to this one, but will post several related links if they look useful for you. Atre you interested in getting enough grants that you will have to hire yet another researcher or two to handle them all?
	ID: 14160 \| Rating: 0 \| rate: / Reply Quote

Jari Pyyluoma Send message Joined: 2 Aug 08 Posts: 12 Credit: 1,165,835,704 RAC: 0 Level Scientific publications	Message 14163 - Posted: 8 Jan 2010 \| 20:23:49 UTC - in response to Message 14159.
	Well if the problem with the Toni work units can't be solved a work around better be made. In other projects the participants can chose what work they want to do. Let people chose the Toni if they want to, for instance if knowing that their hardware does not fail them. It is reasonable to give a slightly higher credit on problematic work. I you want to experiment with new work units, make it a voluntary choice. Right now I do not trust this project, so I supervise the downloads to remove any Tonis that might show up. This also means that my GPUs mostly run another project, something that I am unhappy with. Do not repeat the mistakes of other projects who have lost most of their donors. Do not put out work units that are of no use. Do not take chances with our time and our money. I hope you make smart decisions so I can trust you again. Because I like this project.
	ID: 14163 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14189 - Posted: 13 Jan 2010 \| 0:19:44 UTC - in response to Message 14163.
	Jari Pyyluoma, I broadly agree with your concerns; I too am annoyed by wasting some of my efforts: Although my GTS250 has recently faired a bit better (RAC 8255), largely due to the changes made by the Techs, my 8800GTS 512MB is all but lost – RAC is about 900! I know it is old kit, but there are lots of people that have old kit. However, on a positive note, my recent reviewing has shown that some of the new cards, although not magnificent, can still contribute substantially. The GT 240 in particular is a worthy card. Very reliable. So I would suggest to anyone that wants to continue participating, Sell your old cards and buy a new one. A GT 240's can be purchased from between £60 an £80. The running costs are about one third of top end Compute Capable 1.1 cards, so over 6months crunching you will save: If you run a 9800 GTX for 6 months the running cost = 180W * £1.20 per Year Watt * ½ year = £108 Sell your card for £25 minimum. Buy a new GT 240 for £65 and you spend a total of £40 Run a GT 240 for 6 months = 60W * £1.20 per Year Watt / 2 = £36 Total cost of buying a new card and running it = £36 + £40 = £76 So, over 6months crunching time you would save £108 - £76 = £32 Over a year of crunching that works out at a saving of £216-36-36-40 = £104 Oh, plus you get a better card! From a network managers point of view, the fact that you would break even within 4months is Very attractive! Under full load a GT 240 will use about 50 or 60 Watts and give you about 6900points per day. About 125 Points per Watt day. Under full load a GTS 250 will use about 184Watts and give you about 8250points per day (perhaps partially due to failures). About 44 Points per Watt day! Even my GTX 260 sp216 (55nm version) only gets 14000points per day, and eats up about the same Watts as a GTS 250. About 76 Points per Watt day. Given that Three GT 240 cards would use less electric than One GTS 250 and do more than twice the work, these cards are very efficient! In terms of Points per Watt, the GT 240 IS BY FAR the most efficient card available to GPUGrid supporters! It will also do TEN times the work of an overclocked i7-920 running at 3.8GHz and using over 300W. It is a NO BRAINER! Ref: http://benchmarkreviews.com/index.php?option=com_content&task=view&id=423&Itemid=72&limit=1&limitstart=11 http://www.guru3d.com/article/msi-geforce-gt-240-review-test/5 My Stats!
	ID: 14189 \| Rating: 0 \| rate: / Reply Quote

Jari Pyyluoma Send message Joined: 2 Aug 08 Posts: 12 Credit: 1,165,835,704 RAC: 0 Level Scientific publications	Message 14198 - Posted: 15 Jan 2010 \| 5:30:11 UTC - in response to Message 14189.
	Thanx, that sounds like great news. I just happened to get some nvidia cards from a friend. Yes, I also have the feeling that I should pass them on, and now you have proven it with numbers. Well, I had a bunch of ps3. I never quite understood the reasoning behind stopping that part of the project. Seems that people running the project can change their minds from one day to another. So, buying a card just for this project is out of the question. Ati cards are better for the other projects. The project has been pushing those flaky Toni work units on me, and I have been aborting them. Seems that it always has been someone with a 295 that finishes them. I wish this project had the back bone that folding@home has, they test their work before putting it in production, and they react to feedback, and most of all - they still support the ps3. I guess the people running this project feel let down by their university and can't find it in themselves to create great work. I have a very hard time trying to understand what the problem with funding is, with a top notch project like this. Maybe the university is to small and insignificant to be able to make its name known, compared with a university like Stanford.
	ID: 14198 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14201 - Posted: 15 Jan 2010 \| 14:17:45 UTC - in response to Message 14198. Last modified: 15 Jan 2010 \| 14:34:13 UTC
	Sony apparently stopped allowing the use of Linux on the PS3, required to run GPUGrid. Lots of G92 cards are struggling due to a number of things outside the control of the project, including the reliance on NVidia for code. If there is a bug in their code and the project team are not allowed to correct it, there is nothing they can do. If the project team was larger I am sure they would be better equipped to make more changes, perhaps even write tasks for the older cards, but as is things appear to be tight. This is a good project and needs support. To me it makes sense to sell on old parts and replace them with new parts that are better and actually work well.
	ID: 14201 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14235 - Posted: 18 Jan 2010 \| 13:31:20 UTC - in response to Message 14201. Last modified: 18 Jan 2010 \| 13:32:42 UTC
	I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code. We are working on the problem, of course; in the meantime, I've suspended the generation of the last HERG workunits (most of them were stopped before Christmas), even though this is not really a "solution": any bugs it triggers will not go away.
	ID: 14235 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14924 - Posted: 2 Feb 2010 \| 15:01:16 UTC - in response to Message 14235.
	I'm sending a few more workunits of the HERG type. In the meanwhile, if you want to see images of what you are crunching, have a look at the flickr page!
	ID: 14924 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,426,089 RAC: 210,585 Level Scientific publications	Message 14928 - Posted: 2 Feb 2010 \| 16:43:07 UTC - in response to Message 14235.
	I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code. World Computing Grid allows participants to choose workunits by making different types of workunits different subprojects and allowing the participants to choose which subprojects to run. Any particular reason why you can't do the same, even if it requires providing the same application program under more than one name?
	ID: 14928 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14930 - Posted: 2 Feb 2010 \| 16:57:21 UTC - in response to Message 14928.
	Well, bugs may come and go with cards, drivers, and versions of the application. We try to make all WUs run equally well, rather than fork (and maintain) separate queues.
	ID: 14930 \| Rating: 0 \| rate: / Reply Quote

[AF>Libristes>Jip] Elgran... Send message Joined: 16 Jul 08 Posts: 45 Credit: 78,618,001 RAC: 0 Level Scientific publications	Message 14951 - Posted: 3 Feb 2010 \| 12:06:46 UTC Last modified: 3 Feb 2010 \| 12:07:19 UTC
	I also have another compute error with this type of workunit. My workunit My computer
	ID: 14951 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14968 - Posted: 3 Feb 2010 \| 18:50:48 UTC - in response to Message 14951.
	As soon as the new application will be available, I'll migrate the workunits to it, crossing fingers it will improve the situation.
	ID: 14968 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14969 - Posted: 3 Feb 2010 \| 18:53:14 UTC - in response to Message 14968. Last modified: 5 Feb 2010 \| 12:03:22 UTC
	Feel free to see the new molecular images on flickr...
	ID: 14969 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 14981 - Posted: 4 Feb 2010 \| 11:11:31 UTC - in response to Message 14969. Last modified: 4 Feb 2010 \| 11:26:36 UTC
	The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience...
	ID: 14981 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15000 - Posted: 5 Feb 2010 \| 1:26:43 UTC - in response to Message 14981.
	I have had two crash and no sucess on a stable card. ____________ Thanks - Steve
	ID: 15000 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15009 - Posted: 5 Feb 2010 \| 10:22:00 UTC - in response to Message 15000. Last modified: 5 Feb 2010 \| 10:29:01 UTC
	Which of the two hosts? BTW, for those of you crunching beta, the L*_TONI_TEST WUs are the same as the HERG and HERGext ones.
	ID: 15009 \| Rating: 0 \| rate: / Reply Quote

[AF>Libristes>Jip] Elgran... Send message Joined: 16 Jul 08 Posts: 45 Credit: 78,618,001 RAC: 0 Level Scientific publications	Message 15017 - Posted: 5 Feb 2010 \| 11:32:09 UTC
	Another compute error with a GTX295 on this computer .
	ID: 15017 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15018 - Posted: 5 Feb 2010 \| 12:00:40 UTC - in response to Message 15017. Last modified: 5 Feb 2010 \| 12:02:01 UTC
	Does not seem to be HERG specific, you also had an error on task 1816245 of another batch.
	ID: 15018 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 15052 - Posted: 7 Feb 2010 \| 9:49:01 UTC - in response to Message 13876.
	.. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468 Pleased to report that one of my 9800GT cards has successfully completed a TONI_HERG qext.
	ID: 15052 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15055 - Posted: 7 Feb 2010 \| 11:28:11 UTC - in response to Message 15052. Last modified: 7 Feb 2010 \| 11:28:59 UTC
	Good!
	ID: 15055 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 15057 - Posted: 7 Feb 2010 \| 13:14:22 UTC
	D1s30c47-TONI_HERGqext-2-60-RND0387_0 http://www.gpugrid.net/workunit.php?wuid=1148761 <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>D1s30c47-TONI_HERGqext-2-conf_file_enc</file_name> <error_code>-119</error_code> <error_message>MD5 check failed</error_message> </file_xfer_error> </message> ]]>
	ID: 15057 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15058 - Posted: 7 Feb 2010 \| 15:42:50 UTC - in response to Message 15057. Last modified: 7 Feb 2010 \| 17:09:24 UTC
	Thanks, well spotted. I tried to replace a parameter on the fly. Please post if it happens again.
	ID: 15058 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 15060 - Posted: 7 Feb 2010 \| 17:43:34 UTC - in response to Message 14981.
	The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience... I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext. Time per step: 62.932 ms Example The TONI_HERGext running only ~6,5h Time per step: 37.026 ms Example "slightly? :-) longer computation"
	ID: 15060 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15061 - Posted: 7 Feb 2010 \| 19:11:32 UTC - in response to Message 15060. Last modified: 7 Feb 2010 \| 19:26:25 UTC
	I also noticed the increase, and that was higher than expected. This is what I was trying to fix... The new ones should be back to the norm.
	ID: 15061 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 15735 - Posted: 13 Mar 2010 \| 18:10:42 UTC
	I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32: a43-TONI_HERG77a-1-100-RND4354_0 a317-TONI_HERG79a-0-100-RND8649_1 a268-TONI_HERG79a-1-100-RND6278_1 Three deifferent machines, three CUDA cards - two 9800GT at stock, one 9800GTX+ factory overclocked.
	ID: 15735 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 15812 - Posted: 18 Mar 2010 \| 12:43:10 UTC
	Does anyone have any idea on these? Since reporting these errors, all three cards have worked full time on GPUGrid (another refugee from SETI!), around 30 tasks completed, and with 100% success rate - including a couple of the long-running TONI_GA. But I've continued to abort TONI_HERG on sight (apologies once again to the researchers on that project) until the situation is clearer.
	ID: 15812 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15814 - Posted: 18 Mar 2010 \| 13:13:11 UTC - in response to Message 15812.
	I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks.
	ID: 15814 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,466,430 RAC: 19,993,329 Level Scientific publications	Message 15823 - Posted: 19 Mar 2010 \| 8:23:45 UTC
	Another slipped in while I was asleep: a8-TONI_HERG77a-9-100-RND1351_1
	ID: 15823 \| Rating: 0 \| rate: / Reply Quote

X-Files 27 Send message Joined: 11 Oct 08 Posts: 95 Credit: 68,023,693 RAC: 0 Level Scientific publications	Message 15966 - Posted: 24 Mar 2010 \| 21:37:34 UTC
	here's a bad WU: http://www.gpugrid.net/workunit.php?wuid=1282907 ____________
	ID: 15966 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15969 - Posted: 25 Mar 2010 \| 11:29:44 UTC - in response to Message 15966.
	This bad one seems to have been created by some file transfer error. It should fail immediately.
	ID: 15969 \| Rating: 0 \| rate: / Reply Quote

mwgiii Send message Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level Scientific publications	Message 16173 - Posted: 5 Apr 2010 \| 14:36:26 UTC Last modified: 5 Apr 2010 \| 14:38:51 UTC
	I have now had 3 in a row fail. 1st: http://www.gpugrid.net/result.php?resultid=2093496 2nd: http://www.gpugrid.net/result.php?resultid=2096024 3dr: http://www.gpugrid.net/result.php?resultid=2103499 I rebooted after the 1st fail. The 2nd failed after 523 seconds and the 3rd after 9.1 seconds. The failures are also putting random sparkles on my screen. Looking back at my history, I also had one fail on April 1st: http://www.gpugrid.net/result.php?resultid=2082136 All 4 have the same error message: MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [M_shake_position_kernel_step_1] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 Intel Q9450 quad with Windows Vista Premium x64. Nvidia 9800 GTX+ with driver 197.13. Boinc 6.10.18 ____________
	ID: 16173 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,383,411,148 RAC: 1,334,241 Level Scientific publications	Message 16195 - Posted: 7 Apr 2010 \| 15:55:24 UTC
	I have one similar error after crunching 13 hours. MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 I do not know which gpu it failed on, either the GTS250 with 1mb of memory or the 9800gtx+ with .5mb memory.
	ID: 16195 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16197 - Posted: 7 Apr 2010 \| 17:04:58 UTC - in response to Message 16195.
	A GTS250 is very similar (almost identical) to a 9800 GTX+ So it is probably not that important, unless you are getting lots of failures. The half a 1GB vs 500MB does not make any difference here.
	ID: 16197 \| Rating: 0 \| rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16220 - Posted: 9 Apr 2010 \| 6:57:56 UTC
	After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp. Task 2120270. ____________ Ton (ftpd) Netherlands
	ID: 16220 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 16313 - Posted: 15 Apr 2010 \| 18:12:32 UTC Last modified: 15 Apr 2010 \| 18:33:47 UTC
	During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped with a "acemd... error bubble". About 4 weeks ago I tried not to click "OK" but restarting the PC (with open "error bubble")- After the restart the WU has been restarted at the checkpoint and finished valid! I verified this behavior with 5 further WUs. Every (valid) result shows similar "stderr out" ..................................................................... # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Time per step: 69.189 ms # Approximate elapsed time for entire WU: 43242.851 s called boinc_finish Validate state Valid .......................................................................... Last example: http://www.gpugrid.net/result.php?resultid=2158139
	ID: 16313 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : hERG: information and issues

	About	Science	Volunteers	Performance	Forum	Join us	Donate