Message boards : Graphics cards (GPUs) : Opinions please: ignoring suspend requests
Author | Message |
---|---|
As I understand things, if "the use at most CPU %" is set at anything less than 100%, the BOINC client repeatedly suspends the app. Based on the number of complaints of slow performance I see here and that are attributable to this, it's a problems that seems to catch out many users. | |
ID: 33184 | Rating: 0 | rate: / Reply Quote | |
Hi Matt, | |
ID: 33185 | Rating: 0 | rate: / Reply Quote | |
I would, reluctantly, support that approach as a temporary stop-gap only. | |
ID: 33186 | Rating: 0 | rate: / Reply Quote | |
Even the down clocking of the GPU clock as a WU has failed or terminated to prevent, if this is something in your power to do so off course. Downclocking is a safety feature built into both hardware and drivers since about the Fermi launch. I don't think anyone should try to circumvent that - it would be like welding shut the safety valve on a steam boiler. I'm sure downclocking a GPU is a source of irritation when it happens, but better to address the cause, than to close your eyes to the symptom. Unlike a steam boiler, a molten GPU is hardly likely to kill anybody - but it could be expensive to replace. | |
ID: 33188 | Rating: 0 | rate: / Reply Quote | |
I agree with you Richard, however this down clocking emerged frequently at my rig with the 660 since the introduction of the "termination to prevent hang up". I am not sure that it has to do with that, but it seems so. It is now every other day I have to boot my rig to solve this, and I am not always there to do so. But we are going of topic, so I'll stop about it. | |
ID: 33190 | Rating: 0 | rate: / Reply Quote | |
[QuoteI would, reluctantly, support that approach as a temporary stop-gap only[/quote] | |
ID: 33194 | Rating: 0 | rate: / Reply Quote | |
I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit. | |
ID: 33195 | Rating: 0 | rate: / Reply Quote | |
With the current application this should no longer be occuring. The application defers suspension and termination until a safe point. The Boinc library is no longer able to asynchronously suspend the process, since that was a significant cause of instability. Matt | |
ID: 33197 | Rating: 0 | rate: / Reply Quote | |
Well it happened yesterday twice on my rig with the 660, using app 8.14 (cuda55). ____________ Greetings from TJ | |
ID: 33199 | Rating: 0 | rate: / Reply Quote | |
OK TJ, could you please elaborate on the circumstances, in a private message to me. | |
ID: 33200 | Rating: 0 | rate: / Reply Quote | |
You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled. | |
ID: 33208 | Rating: 0 | rate: / Reply Quote | |
You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled. Im gong to err on the side of caution here, and agree with Jacob. It's up to the user to get their settings straight, and not the devs to override potentially "bad" settings without user consent. That is if I do understand what we're talking about correctly. :) P.S. No matter how much trouble it's causing, in this case, it's up to the user to fix. | |
ID: 33211 | Rating: 0 | rate: / Reply Quote | |
I'd also like to add another opinion: | |
ID: 33227 | Rating: 0 | rate: / Reply Quote | |
My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here. | |
ID: 33229 | Rating: 0 | rate: / Reply Quote | |
My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here. I can't see your computers - they are hidden - but could you expand further? Does your Titan actually use 82.7% CPU - i.e., is CPU time recorded for your results 82.7% of total runtime? The average figure for my Kepler is nearer 98.5%, but for my Fermi more like 7.9%. BOINC Manager displays a status line like 'Running (X.XX CPUs + Y.YY NVIDIA GPUs)'. This, like many of BOINC's numbers, is a server-generated estimate, and bears no relationship to the actual observed behaviour of your own CPU/GPU combination. But it is this estimate which will in the end be compared with David's arbitrary 0.5 CPU threshhold "to throttle, or not to throttle". It would be best to devote our efforts into educating David Anderson into the behaviour and requirements of real GPUs in the real world, and get him to program BOINC accordingly - remembering not to pretend that the best settings for any one single project are necessarily correct across the board. | |
ID: 33230 | Rating: 0 | rate: / Reply Quote | |
I expect the 82.7% is just what's being reported in Boinc Task Properties (and what is used by the scheduler).
Use GPU while computer is in use (when not Ticked) While processor usage is less than (anything other than zero) Use at most % CPU time (anything other than 100%)
| |
ID: 33231 | Rating: 0 | rate: / Reply Quote | |
I expect the 82.7% is just what's being reported in Boinc Task Properties Reply: That is correct. When the task is suspended it will initially read as "Schedluler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks will now run. If a non-Noelia task uses significantly less than a full CPU (as in <95%) then it suggests there might be a problem - the CPU is being overused (by other apps), the CPU is not capable of supporting the GPU fully (architecture or setup), or some other hardware is limiting the GPU (PCIE, RAM speeds, clogged up drive). Reply. - this may be true, but when app8.13 with the 326.80 beta driver was being used this never happened. Should I use a app info file and force it to carve out 1 CPU for the video card? There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently: While computer is in use (when not ticked) - Ticked Use GPU while computer is in use (when not Ticked) - Ticked While processor usage is less than (anything other than zero) - is zero Use at most % CPU time (anything other than 100%) -100 % Maybe using some UPS's and having 'While computer is on batteries' unselected could cause issues, or if a laptop battery is misreporting its connection state (OS or OS driver bugs). This is ticked or selected. In my experience some default settings and recommended settings at other projects are unsuitable for crunching at GPUGrid. It would be useful if there was some sort of way to automatic select recommended settings from your online account (or a Project Button in BM). System information: Boinc version 7.0.28 Genuine Intel(R) Xeon(R) CPU 1366 @ 3.20GHz [Family 6 Model 44 Stepping 2] (12 processors) [2] NVIDIA GeForce GTX TITAN (4095MB) driver: 327.23 Microsoft Windows 8 x64 Edition, home edition 12 gigs ram This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed. Any ideas are appreciated. Thank you. | |
ID: 33233 | Rating: 0 | rate: / Reply Quote | |
There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently: Anybody reading and following that advice needs to be reminded that there are two different ways of setting computing preferences. 1) Via the Computing preferences link on your account page 2) Via the Tools|Computing preferences... menu in BOINC Manager. Some web-based preference setting tools have the opposite sense to the list you quote: for example, the first one reads "Suspend work while computer is in use?" when using the web-based preference setting. To avoid suspending tasks while you are using the computer.uncheck (untick) the web setting, or check (tick) the local setting. And so on. Whichever set you use, read the wording carefully. Users should try to be clear in their own mind whether they are using web settings or local settings, and use one technique exclusively. In particular, even simply viewing the settings locally, and closing the dialog by clicking the OK button, creates a complete snapshot of the current active settings and uses it exclusively from that point onwards: future web-based preference changes will be ignored. | |
ID: 33234 | Rating: 0 | rate: / Reply Quote | |
Raymond: not sure if you understood it correctly. What ever number of CPU usage the BOINC manager has absolutely no effect on how GPU-Grid runs. The app takes its core, no matter what. | |
ID: 33236 | Rating: 0 | rate: / Reply Quote | |
@MJH: obeying the suspend requests is surely the safe way. I'm not sure what unwanted side effects an override may have now and in the future. | |
ID: 33237 | Rating: 0 | rate: / Reply Quote | |
Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed. Does this 'start-stop' behavior occur when you are not using the CPU, or when you run different CPU projects (than those you mention)? A 12 thread processor that's maxed out could conceivably struggle to feed a Titan. Have you tried setting Boinc to use 10 threads (83.4%)? When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler? To test if power/overuse of the CPU is the issue, stop running CPU tasks, and just run GPU tasks. To test if heat is the issue, take the side of the case off and run a WU or two. Might Boinc 7.0.28 be an issue? ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help | |
ID: 33251 | Rating: 0 | rate: / Reply Quote | |
Might Boinc 7.0.28 be an issue? BOINC v7.0.28 doesn't suffer from the "suspend GPU apps when applying thermal throttling to CPUs" bug that started this thread off. It has other weaknesses, but I don't think they would be relevant in this case. I'd leave BOINC unchanged for the time being - work on the other possible causes, one at a time, until you find out what it is. | |
ID: 33252 | Rating: 0 | rate: / Reply Quote | |
Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. I reduced the power and the temperature target to 69c and at 69c and 71c the "stop and go" occurred. When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler? I have a 1250W PSU attached to this computer. The CPU is internal water cooled. I have had the side case removed since June 2013 to generate maximum air flow and the computer is near an open window. I will add a small desktop fan and position the fan and generate air flow towards the cards. I have reduced the CPU usage where only GPUGRID tasks are running an no other CPU tasks are being run. The "stop and go" event still occurred with zero CPU tasks being run or being listed in the BOINC client. The BOINC local and project WEB settings for any projects are always the same as I dislike potential conflicts. Yes, I have learned my lesson on that issue the hard way a while back. The one possibility that was raised is the CPU core is being maxed out and could conceivably struggle to feed a Titan. I will run the CPU without hyper threading as that is the only other possibility that has not been explored. | |
ID: 33266 | Rating: 0 | rate: / Reply Quote | |
Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required? | |
ID: 33315 | Rating: 0 | rate: / Reply Quote | |
Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required? When the task is suspended by the computer I see "Scheduler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks run. So the BOINC client window will show one or two GPU tasks in "waiting" mode and partially run while other the other two tasks are being crunched. I do not see any "client suspended (user request)" description. I have recently see the BOINC message window saying to the effect that if this keep occurring (the "stop and start" process) I should reset the project and that is my next step. The actual tasks take approximately the same amount of GPU/CPU crunch time if they were crunched continuously or by this "stop and start" issue. The tasks being stopped due to the "Scheduler wait: access violation" and then later as "waiting to run" issue that causes the the "wall clock" time to occur much longer as I am now crunching three or four tasks at the same time and switching back and forth between them instead of two tasks continuously being crunched at the same time. | |
ID: 33348 | Rating: 0 | rate: / Reply Quote | |
It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try! | |
ID: 33382 | Rating: 0 | rate: / Reply Quote | |
It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try! Installed 331.40 beta and no issues to report. It would appear this situation maybe resolved. Thank you for posting this information here. | |
ID: 33387 | Rating: 0 | rate: / Reply Quote | |
Regarding the downclocking TJ and others experienced, | |
ID: 33389 | Rating: 0 | rate: / Reply Quote | |
If you find nVidia bugs (like cannot use nVidia Control Panel when no monitor is connected, or performance degradation occurs on Adaptive Performance even when under full 3d load from a CUDA task)... Please report your issues in the nVidia driver feedback thread on their forums! Sometimes they are responsive. | |
ID: 33390 | Rating: 0 | rate: / Reply Quote | |
Skgiven I think that what we see in the stderr file is the boost value of the GPU and not the actual value or the value it ran at. | |
ID: 33395 | Rating: 0 | rate: / Reply Quote | |
The clock Matt is displaying seems to be the "Boost Clock" property of a GPU. It may be what nVidia think a typical boost clock in games will be for the card. For my GTX660Ti I've got: | |
ID: 33402 | Rating: 0 | rate: / Reply Quote | |
The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt. | |
ID: 33403 | Rating: 0 | rate: / Reply Quote | |
It always seems to be what GPU-Z displays as the Default Clock Boost. Not any over/underclock done with EVGA Precision or actual boost. The GPU-Z pop up description simply states: "Shows the default turbo boost clock frequency of the GPU without any overclocking." | |
ID: 33406 | Rating: 0 | rate: / Reply Quote | |
Yes, if you use GPUZ (alas only available on Windows) under the Graphics Card tab it will show what I think is the 'maximum boost clock'; an estimate of what the boost would be if the power target, Voltage and GPU utilization was 100% (I think). | |
ID: 33409 | Rating: 0 | rate: / Reply Quote | |
For me, for my eVGA GTX 660 Ti 3GB FTW (factory overclocked), I use a custom GPU Fan profile (in Precision-X), so while working on GPUGrid, the temperatures almost always drive the fan to maximum fan (80%). The clock is usually at 1241 MHz, and the Power % is usually around 90-100%. But there are times where the temperature still gets above 72*C, in which case the clock goes down to 1228 MHz. Also, there are times when a task really stresses the GPU (such that it would normally downclock to not exceed the 100% power target, even if temps are below 70*C), so in Precision-X, I have set the power target to its maximum, 140%. So, in times of considerable stress, I do sometimes see it running at 1241 MHz, at ~105% Power, while under 72*C. Regarding reported clock speeds, even though the GPU is usually at maximum boost (1241 MHz), I believe GPUGrid task results always reports 1124 MHz (which I believe may be the "default boost" for a "regular" GTX 660 Ti), which is fine by me. | |
ID: 33411 | Rating: 0 | rate: / Reply Quote | |
Below outputs from GPU-Z as compared to the stderr output. | |
ID: 33412 | Rating: 0 | rate: / Reply Quote | |
The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt. For Keplers it's the "Boost clock" defined in their bios (hence it's not affacted by the actual boost or downclocking), whereas for older cards the shader clock is reported (by the runtime). Not sure how GPU-Z and others are doing it, but I suspect the runtime can also report the actual instantaneous clock which we're interested in. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 33425 | Rating: 0 | rate: / Reply Quote | |
This thread has digressed some way from the original topic. | |
ID: 33440 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : Opinions please: ignoring suspend requests