GPUGrid problems, nothing has changed

Message boards : Number crunching : GPUGrid problems, nothing has changed

Author	Message
Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51606 - Posted: 7 Mar 2019 \| 17:10:39 UTC
	This is the 3rd time that I've gone in heavily on GPUGrid over the last 10-11 years. Twice I've gotten frustrated with the problems and cut way back. I was hoping that some of the issues would have been fixed. There's been an ongoing problem of stalling uploads (not to mention downloads) for many years. It's still not fixed. In addition WUs that get interrupted often fail even with write caching disabled on the drives. Case in point. Last night we had a 3 hour power outage. When I brought the machines back up 18 out of 25 GPUGrid failed. There was not even one failure for any WU from any other project. These failures also cause another problem. Since 18 new WUs start at the same time it causes them to finish at about the same time. So many huge GPUGrid WUs uploading at once saturates my bandwidth for many hours. Yes, I live in the US so my DSL connection is not fast even though it was upgraded a few months ago (only 1 provider here, how do you spell monopoly). Unbridled capitalism is a bad idea for 99.9% of the people. Anyway, the combination of poor broadband infrastructure and these long standing GPUGrid problems sadly pushes me to cut back on this otherwise fine project. It seems to me that some of this should be not that difficult to fix, but apparently the necessary skills aren't present. BTW, these "upload storms" have been happening regularly. For someone with a faster connection and/or fewer GPUs it may not seem like a problem, but it's a problem here and I know of no way to solve it on my end. Thanks for listening to my frustration.
	ID: 51606 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 2,851,549 Level Scientific publications	Message 51607 - Posted: 7 Mar 2019 \| 18:50:11 UTC - in response to Message 51606.
	Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?
	ID: 51607 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51608 - Posted: 7 Mar 2019 \| 19:58:46 UTC - in response to Message 51607.
	Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating.
	ID: 51608 \| Rating: 0 \| rate: / Reply Quote

AuxRx Send message Joined: 3 Jul 18 Posts: 22 Credit: 2,758,801 RAC: 0 Level Scientific publications	Message 51609 - Posted: 7 Mar 2019 \| 20:49:22 UTC - in response to Message 51606.
	I know the frustration, but ironically GPUGRID is the better project to me by a small margin. I wouldn't even know how the team at GPUGRID could fix the issues you're describing. Aren't those BOINC related issues? I know I have seen similar issues discussed in other projects. The solution was to run a start up script that booted the machines or restarted the clients with some delay. Alternatively you could consider limiting the number of connections for BOINC, which would be slower (unnecessarily slow at times) but more evenly distributed.
	ID: 51609 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,216,806,959 RAC: 10,896,343 Level Scientific publications	Message 51610 - Posted: 7 Mar 2019 \| 21:03:14 UTC - in response to Message 51608.
	Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating. If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml <max_file_xfers_per_project>1</max_file_xfers_per_project> That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster.
	ID: 51610 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51611 - Posted: 8 Mar 2019 \| 1:45:15 UTC - in response to Message 51610.
	If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml <max_file_xfers_per_project>1</max_file_xfers_per_project> That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster. Thanks, I've been meaning to try this. The problem then becomes that the CPU WUs create a huge backlog waiting while the huge GPUGrid upload stumbles along. The Ryzen 7 machines do a lot of CPU work pretty quickly. No wait, that's a command that I didn't know (per project). I will definitely try it. Thanks again!
	ID: 51611 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 332 Credit: 4,193,721,065 RAC: 16,668,933 Level Scientific publications	Message 51613 - Posted: 8 Mar 2019 \| 2:00:51 UTC Last modified: 8 Mar 2019 \| 2:01:08 UTC
	There is an option for the entire client and one per project. https://boinc.berkeley.edu/wiki/Client_configuration
	ID: 51613 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51614 - Posted: 8 Mar 2019 \| 3:56:05 UTC - in response to Message 51610.
	<max_file_xfers_per_project>1</max_file_xfers_per_project> That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster. Seems to be helping, there's not as much stalling. Will continue to monitor.
	ID: 51614 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 2,851,549 Level Scientific publications	Message 51634 - Posted: 15 Mar 2019 \| 17:14:59 UTC - in response to Message 51607. Last modified: 15 Mar 2019 \| 17:15:56 UTC
	Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set.
	ID: 51634 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,837,907,676 RAC: 16,423,776 Level Scientific publications	Message 51635 - Posted: 15 Mar 2019 \| 17:30:56 UTC - in response to Message 51634.
	I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set. Yes, this was/is exactly it.
	ID: 51635 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51640 - Posted: 18 Mar 2019 \| 15:35:24 UTC - in response to Message 51634.
	I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set. I've been unchecking that for years. Yes it helps but it didn't help with the power outage and 18 failed WUs that I described in the OP. All the drives on all my BOINC machines had write caching disabled.
	ID: 51640 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 2,851,549 Level Scientific publications	Message 51642 - Posted: 18 Mar 2019 \| 21:22:07 UTC
	Interesting, it seemed to eliminate the problem for me when I enabled it
	ID: 51642 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51644 - Posted: 19 Mar 2019 \| 15:28:20 UTC - in response to Message 51642.
	Interesting, it seemed to eliminate the problem for me when I enabled it I also believed that before March 7th. Then I was educated x 18. However, it does help when write caching is disabled. One related thing I've found is that when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs as they seem to survive that situation. Knock on wood... ;-)
	ID: 51644 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,837,907,676 RAC: 16,423,776 Level Scientific publications	Message 51645 - Posted: 19 Mar 2019 \| 17:47:01 UTC - in response to Message 51644.
	... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ... how do your educate Win10 to wait long enough until the GPUGRID tasks stops? Even if a GPUGRID task is manually stopped in the BOINC manager, it takes up to a minute until it actually stops.
	ID: 51645 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51646 - Posted: 20 Mar 2019 \| 4:07:18 UTC - in response to Message 51645. Last modified: 20 Mar 2019 \| 4:08:40 UTC
	... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ... how do your educate Win10 to wait long enough until the GPUGRID tasks stops? Even if a GPUGRID task is manually stopped in the BOINC manager, it takes up to a minute until it actually stops. I have no idea. My observation is that SO FAR with 5 Win10 machines running 3 GPUGrid WUs each, I haven't had any WUs fail when Win10 decides to reboot to process updates. This has happened quite a few times. Maybe I've just been lucky, maybe not.
	ID: 51646 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 51696 - Posted: 12 Apr 2019 \| 19:39:08 UTC - in response to Message 51610.
	Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating. If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml <max_file_xfers_per_project>1</max_file_xfers_per_project> That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster. Thanks again for this. It allowed me to keep more GPUs on the project, though I never could get them all shoehorned into my paltry UL bandwidth. Now with the rise of mostly KIX WUs and nearly double the UL size I have the problem again. Maybe someday my area will have better connectivity. For now I've had to transfer many of my GPUs to projects with lesser UL requirements. I very much like GPUGrid but have to lighten up on it for now. Keep up the great work! I'll keep running what I'm able to here.
	ID: 51696 \| Rating: 0 \| rate: / Reply Quote

Helix Von Smelix Send message Joined: 13 Aug 08 Posts: 7 Credit: 553,436,863 RAC: 0 Level Scientific publications	Message 51701 - Posted: 16 Apr 2019 \| 19:03:18 UTC - in response to Message 51606.
	UPS.
	ID: 51701 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 51703 - Posted: 17 Apr 2019 \| 16:07:57 UTC
	Extremely slow uploads here (Menlo Park, Ca) at 9:00 AM Pacific time, I have 100 Mbps down and 40 Mbps up and my connection is working perfectly according to a speed test I just did. I've noticed this only happens about 25% of the time with me, it is a major pain uploading at 300 Kbps.
	ID: 51703 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 968,401,174 RAC: 139,365 Level Scientific publications	Message 52521 - Posted: 24 Aug 2019 \| 4:04:27 UTC Last modified: 24 Aug 2019 \| 4:21:31 UTC
	In the past, my equipment was excluded from GPUGRD at times because of lesser quality and low performing cards. So I finally broke down a few days back and bought an EVGA RTX 2080 with the anticipation of crunching along with the "Big Boys."And of course, quite naturally, I was able over the last couple of days to download a dozen tasks that require 8-12 hours on the fastest cards. And if failure is success then I succeeded perfectly: Every task errored out with minimum time being 8.11 seconds and the longest time before failure was 14.71 seconds. My driver for beginning crunching was the 436.02 and I changed to the 431.60 before failure of the last task. And I did a clean install on the second driver. Equipment is visible. I looked on the performance page and I do not see a performance record for the RTX 2080 card and my cursory look at the tasks results did not show a wingman having processed a task with the 2080. So what do I do? These tasks come few and far between. BTW, my other machine with a GTX 1060 has processed all tasks available without a failure.
	ID: 52521 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 1,192,827,525 RAC: 13,330,730 Level Scientific publications	Message 52522 - Posted: 24 Aug 2019 \| 7:59:41 UTC - in response to Message 52521.
	Your 2080 isn't supported yet, see here for more details... http://gpugrid.org/forum_thread.php?id=4952
	ID: 52522 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 566 Credit: 6,305,202,024 RAC: 16,625,334 Level Scientific publications	Message 52523 - Posted: 24 Aug 2019 \| 8:24:08 UTC
	Your 2080 isn't supported yet, see here for more details... http://gpugrid.org/forum_thread.php?id=4952 Here new Nvidia Turing series GPUs are listed: - NVIDIA TITAN RTX - RTX 2080 TI - RTX 2080 SUPER - RTX 2080 - RTX 2070 SUPER - RTX 2070 - RTX 2060 SUPER - RTX 2060 - GTX 1660 TI - GTX 1660 - GTX 1650 They all will fail every current ACEMD version WU after few seconds from start. The reason: Turing GPUs are not supported so far. GPUGrid team is developing a new ACEMD3 version that is likely to support Turing GPUs. I Have a GTX 1660 TI and a GTX 1650 (impatiently) waiting for this ;-)
	ID: 52523 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 968,401,174 RAC: 139,365 Level Scientific publications	Message 52524 - Posted: 24 Aug 2019 \| 14:19:39 UTC - in response to Message 52522.
	PDW: Thank you! Bill
	ID: 52524 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : GPUGrid problems, nothing has changed

	About	Science	Volunteers	Performance	Forum	Join us	Donate