Advanced search

Message boards : Graphics cards (GPUs) : 2x GPU - One GPU Errors Frequently

Author Message
Redirect Left
Send message
Joined: 8 Dec 12
Posts: 23
Credit: 181,940,893
RAC: 10
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 50395 - Posted: 5 Sep 2018 | 0:59:46 UTC
Last modified: 5 Sep 2018 | 1:18:18 UTC

Hi there.

I am currently running two GPUs, a GTX 760, and a GTX 670. Neither are amazing for number crunching, but they're decent with games and my PC is on 24/7 - so spare time is donated to varying BOINC projects, so things taking time doesn't matter.

Anywho, tasks on the GTX 670 have a habit of erroring, with an unknown reason. An example of one of these tasks is here; http://www.gpugrid.net/result.php?resultid=18676941 - the error is related to the simulation becoming unstable.

It isn't a PSU / power issue, the draw on the PSU is less than 65% of its output on the rails related to graphics, i've tried to re-install drivers, didn't fixc anything. Both GPUs are 2GB GDDR5 VRAM editions.

Is it possible the GTX 670 isn't properly supported by GPUGrid anymore? If this is the case, is it possible to exclude the 670 from GPUGrid, but continue using the 760?

- Cheers!

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50396 - Posted: 5 Sep 2018 | 8:02:27 UTC - in response to Message 50395.

I have a GTX 750 Ti on a Linux box, and a GTX 1050 Ti on a Windows 10 PC, none overclocked. On the Linux the GPU temperature reaches at most 63 C, on the Windows PC 80 C and then it crashes with an error message similar to yours.
Tullio

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1340
Credit: 7,652,966,070
RAC: 13,485,092
Level
Tyr
Scientific publications
watwatwatwatwat
Message 50410 - Posted: 5 Sep 2018 | 20:34:32 UTC

First thing I would do is to increase the fan speed on the 670 to 100% and see if the errors reduce.

The other thing would be to run another BOINC client and exclude the 670 gpu in the cc_config.xml file.

<ignore_nvidia_dev>N</ignore_nvidia_dev>
Ignore (don't use) a specific NVIDIA GPU. You can ignore more than one. Replaces <ignore_cuda_dev/>. Requires a client restart.
Example: <ignore_nvidia_dev>0</ignore_nvidia_dev> will ignore the first NVIDIA GPU in the system.

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 50411 - Posted: 5 Sep 2018 | 21:02:19 UTC

In addition to the below messages and suggestions. What if you let MSI Afterburner limit the GPU temperature to 70°C max … provided that the temperature of this card shows up in there? If not, reduce both the GPU and memory clock manually by maybe 50MHz and see where the temperature gets.
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50515 - Posted: 14 Sep 2018 | 21:24:37 UTC - in response to Message 50395.

... tasks on the GTX 670 have a habit of erroring, with an unknown reason. An example of one of these tasks is here; http://www.gpugrid.net/result.php?resultid=18676941 - the error is related to the simulation becoming unstable.
This is the typical error message for too high GPU clocks at the given GPU temperature.
You should reduce the GPU clock speed of your GTX 670 (or its power target in MSI Afterburner).
Judging by the stderr.txt of your tasks, your other GPU goes up to 93°C, which is dangerously high. This surely reduces the lifetime of your card.
You should increase the airflow of that card. If the two cards are next to each other, I strongly recommend you to physically remove the older card.

Post to thread

Message boards : Graphics cards (GPUs) : 2x GPU - One GPU Errors Frequently

//