Don't understand why it is failing

Message boards : Graphics cards (GPUs) : Don't understand why it is failing

Author	Message
~Stack~ Send message Joined: 10 Dec 09 Posts: 7 Credit: 77,610,772 RAC: 0 Level Scientific publications	Message 44203 - Posted: 16 Aug 2016 \| 23:28:08 UTC
	Greetings, I have recently acquired another Nvidia GPU after a long absence where I only crunched projects with CPU work. All of the other projects are doing great with the GPU. GPUGRID, however, is not. The jobs are failing with this: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 197 (0xc5, -59) </message> <stderr_txt> # SWAN Device 0 : # Name : GeForce GTX 470 # ECC : Disabled # Global mem : 1279MB # Capability : 2.0 # PCI ID : 0000:02:00.0 # Device clock : 1250MHz # Memory clock : 1701MHz # Memory width : 320bit #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 200 </stderr_txt> ]]> I have read and searched online but have not found anything that is relevant to my case. Can someone point out where things might be going bad please? Here is the host: https://www.gpugrid.net/show_host_detail.php?hostid=362128 Also, another bit of relevant data if your are digging through the host logs. I picked up two NVidia cards: A 470 and a 460. After a few weeks of crunching, the 460 crapped out hard core yesterday and shat out all of the GPU work the computer had queued up. It will work for an hour or two after a reboot, then die again. It has since been removed. It was only 30$, what can I expect? sigh :-) Anyway, the point is ignore the 460 workloads; not relevant. The 470, however, is doing great on other projects. Thanks!
	ID: 44203 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 44215 - Posted: 17 Aug 2016 \| 14:22:56 UTC - in response to Message 44203.
	#SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 200 This error message says that the application does not include the parts needed for compute capability 2.0 (called "version 200" above) GPUs. As the GPUGrid app works with CC3.0~CC5.2 cards, your card is too old for this project. I don't recommend to crunch on these very old cards, as their energy efficiency is terrible compared to recent cards.
	ID: 44215 \| Rating: 0 \| rate: / Reply Quote

~Stack~ Send message Joined: 10 Dec 09 Posts: 7 Credit: 77,610,772 RAC: 0 Level Scientific publications	Message 44219 - Posted: 17 Aug 2016 \| 21:44:12 UTC - in response to Message 44215.
	Thanks for that info! I tried searching for that message but never found a good explanation of what it was trying to tell me. Out of curiosity, BOINC tells GPUGRID what card I have. Why doesn't GPUGRID throw an error /before/ it sends the work? I feel kinda bad that I took up work that just errored out like that.
	ID: 44219 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 44221 - Posted: 17 Aug 2016 \| 23:50:14 UTC - in response to Message 44219.
	Thanks for that info! I tried searching for that message but never found a good explanation of what it was trying to tell me. There are a couple of useful threads for novices in the FAQ, for example: FAQ - Recommended GPUs for GPUGrid crunching However, this error message is not listed there. You should try to use the advanced search, and extend the time limit for the search for more results. Out of curiosity, BOINC tells GPUGRID what card I have. Why doesn't GPUGRID throw an error /before/ it sends the work? I feel kinda bad that I took up work that just errored out like that. That's just sloppy business from GPUGrid's part, you should not feel bad. This behavior applies to the brand new GTX 10X0 cards as well, because these are CC6.1 cards, and fail every workunit with the same error, still the GPUGrid scheduler will send them work (until their daily quota reaches 0).
	ID: 44221 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 44228 - Posted: 18 Aug 2016 \| 22:35:33 UTC - in response to Message 44221.
	Thanks, I've updated the FAQ. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 44228 \| Rating: 0 \| rate: / Reply Quote

vseven Send message Joined: 2 Apr 18 Posts: 2 Credit: 1,132,200 RAC: 0 Level Scientific publications	Message 49296 - Posted: 17 Apr 2018 \| 12:22:02 UTC Last modified: 17 Apr 2018 \| 12:22:54 UTC
	Just a FYI I get the same error on a new shiny Tesla V100 using Ubuntu 16.04 and NVidia drivers 390.30 which are fairly new: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 197 (0xc5, -59) </message> <stderr_txt> # CUDA Synchronisation mode: BLOCKING # SWAN Device 0 : # Name : Tesla V100-PCIE-16GB # ECC : Enabled # Global mem : 16160MB # Capability : 7.0 # PCI ID : 94A8:00:00.0 # Device clock : 1380MHz # Memory clock : 877MHz # Memory width : 4096bit # GPU [Tesla V100-PCIE-16GB] Platform [Linux] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : Tesla V100-PCIE-16GB # ECC : Enabled # Global mem : 16160MB # Capability : 7.0 # PCI ID : 94A8:00:00.0 # Device clock : 1380MHz # Memory clock : 877MHz # Memory width : 4096bit #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 700 </stderr_txt> ]]> Which is very unfortunate because I could probably chew through these WU in no time. No issues on the other projects I'm crunching (Seti, Amicable Numbers, Milkyway).
	ID: 49296 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 0 Level Scientific publications	Message 49297 - Posted: 17 Apr 2018 \| 12:37:40 UTC
	I could be wrong, but I don't yet think CUDA 9.0 is supported in this version of ACEMD which is the application for this project.
	ID: 49297 \| Rating: 0 \| rate: / Reply Quote

vseven Send message Joined: 2 Apr 18 Posts: 2 Credit: 1,132,200 RAC: 0 Level Scientific publications	Message 49343 - Posted: 20 Apr 2018 \| 15:32:01 UTC - in response to Message 49297.
	I believe you are correct. I spun up a Ubuntu using a P100 and Cuda 8 and they are now working.
	ID: 49343 \| Rating: 0 \| rate: / Reply Quote

Chilean Send message Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level Scientific publications	Message 50597 - Posted: 25 Sep 2018 \| 12:12:42 UTC - in response to Message 49296.
	Same here. V100's aren't supported. ____________
	ID: 50597 \| Rating: 0 \| rate: / Reply Quote

Steffen Send message Joined: 2 Mar 19 Posts: 2 Credit: 48,438,972 RAC: 0 Level Scientific publications	Message 51778 - Posted: 9 May 2019 \| 17:27:34 UTC
	And now also for a GTX 1660 Ti. Einstein@Home takes it, which is my answer for now.
	ID: 51778 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Don't understand why it is failing

	About	Science	Volunteers	Performance	Forum	Join us	Donate