Author |
Message |
|
Hi there!
When crunching for GPUGRID my computer reboots sometimes after 30min, after 2-5 hours.
The computer turns off with all components still running (except the NVIDIA card I assume) and starts again on its own after ~30sec.
-Since this does not happen crunching Seti-Cuda WUs, it might be GPUGRIDs fault.
-On BOINC 6.4.x this didn't happen often. I connect the reboots back then with the times 4 CPU + 1 GPU(+0.25CPU) were in process but I'm not sure.
# Now I'm running 3 WU on the CPU + 1 GPUGRID CUDA. -> 1hour no reboot so far.
My question: If the above does not help, can I expect a solution by the following?
- Right now 181.20 CUDA driver seems to be installed. Seems to be CUDA 2.1 huh?
# I might try the CUDA 2.0 driver 178.08
# I always wondered why I can't use the 182.50 WHQL driver, I might try that too.
# I'll try to complete the p700000-GIANNI WU (Lost 3 hours of process due to one of those reboots. Maybe the syP9764-SH2_US WU runs better?
# My 5 month old NVIDIA might be broken so it has problems running GPUGRID CUDAs. Seti will be happy if this is the case. ^^
# I won't update NVIDIA bios - solution must be something different.
You can see my system specs right? In addition to that -
CPU is not overclocked (60°C)
GPU overclocked by default. Reboot happened with downclocking to GTX280 specs from nvidia too. (68-78°C)
And if you can't find my system specs: -_-
Vista Ultimate 64
AMD Phenom X4 9950BE
4GB DDR2-Ram
The reboots happened during browsing the internet and listening to mp3. No Games, no 3D-Programs, no CPU intensive processes besides Boinc.
If comp reboots again while running only 3 CPU + 1 CUDA WU, I'll post it.
Thanks for reading and thanks for help in advanced. |
|
|
|
Good, it seems I'm the only one with this problem :) I feel so special. hehe
Now, 3 hours later the p700000-GIANNI_ WU completed, fortunately without error in the outcome. For the record
That makes 4 hours without a reboot.
I intend to crunch the syP9764-SH2_US_ WU together with 3 CPU WUs. If it works I hope to get another syP... WU to run it along with 4 CPU WUs to see if there is a difference to p700000.. WUs.
- If it reboots then even with only 3 CPU WUs, it might be a compatibility issue between GPUGRID and Rosetta, and/or Seti and/or World Community Grid.
# Just checked the PSU - If it is not broken, it should give just enough Power for my system (650 Watt, no USB coffee machine in use)
# No Windows update was performed before reboots happened.
# All Windows updates installed.
# Screensaver is deactivated.
My Gigabyte Mainboard MA-790X-DS4 was cheap, so if it can't cope with the bandwith needed for GPUGRID to run on a GTX280 card, that would be the most 'apprechiated' issue (besides a driver problem). |
|
|
|
The syP9764.. WU completed without any reboot with 3 CPU WUs running.
Sadly, while crunching p1040000-GIANNI_ the computer rebooted at 97.xx% progress and resumed at 95.xx%.
# I noticed that Windows update just installed an update for Windows defender and # System Restore Points were created.
# Almost at the same time, Itunes downloaded mp3s.
I'm not sure if this causes problems. Likely not my guess.
A kh29119-Jan2 WU is next - testing along with 3 CPU WUs also. |
|
|
|
A reboot in the first 20 min of the kh29119-Jan2 WU along with 3 CPU Tasks urged me to try to change the driver to 182.50. Alongside, changed from using NVIDIA System Tools 6.02 to 6.03 (may or may not be part of the problem).
Running GPUGRID + 4 CPU Tasks now.
If the computer maintains stable, it would be the first time a new driver would have fixed a 'Windows issue' in my 9 years of having my own Desktop.
I just wonder why to use the newest WHQL Driver would be better instead of the 'NVIDIA Driver with CUDA Support' from the link provided in the manual on the Get CUDA website.
Btw.: The WHQL Driver seems to need more time to complete a WU IMHO. Not CUDA optimized but more stable?
Since nobody replied to this topic so far, thanks for providing a public GPUGRID diary for me. lol |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Sorry nobody replied. I have never heard of reboots though!
gdf |
|
|
|
Dear GDF,
no problem. Did I use a wrong term? By reboot I mean the computer restarts by itself.
The new driver did not fix it. Computer just restarted again.
I don't tend to see a connection between running GPUGRID CUDAs along with CPU WUs from other projects anymore.
Just downloaded 2 more GIANNI WUs and I'm expecting reboots while running them.
I don't have proof of syP9764 WUs will never cause me reboots and I've decided NOT to abort all WUs I get until I get a syP9764 WU again.
After completing the GIANNI Tasks, I'm off and I may come back when I get a new video card (This is my card from MSI I hope those who have the same don't have such problems)
and/or a new CPU/mainboard/RAM. Maybe even if new WHQL Driver are released.. =)
|
|
|
|
Dear Diary,
I just had a sudden restart, when I tried to tell you that I called my hardware dealer.
They say my card is broken and they want to give me a new one. ( 4.5 years guarantee left O_o ).
They didn't even ask about drivers.
If they don't have the same card they give me the money back. This happened 4 years ago too and I came back with a much faster card.
Back then, not even when I bought the card 9 month ago, I would never thought about having a video card replaced because of having problems to do scientific research.
Sorry@all who don't think this is the right place for a diary. lol Couldn't resist.
later,
Moabiter
____________
Relying on Boinc to satisfy your need for happiness might result in temporary unhappiness. This can be applied to almost everything. |
|
|
uBronan Send message
Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level
Scientific publications
|
Well its hard to find a solution on this matters thats why i was reading but have no solution at hand.
Most of the time these problems are either hardware related or driver related thats why i think people where not reacting |
|
|
|
Hi uBronan!
Well, a new video card might be the only solution left if I want to run GPUGRID ;) Still, it would be great if I don't have to wait for a new video card. But now I can buy a cheap no CUDA supporting card and wait 3 days until a new high-end GPU is send to me. I'll try to do that next week.
Just to clear things up:
Sorry in advanced for my poor English. I read better than I type or spell.
outside 19°C/day
videocard 71°C - fanspeed 75%
outside 5°C/night
videocard 78°C - fanspeed 55%
This GPU temperatur I read with Speedfan 4.37. Same with Nvidia System Monitor.
So far I have not seen any spikes withing the 500ms display update time.
Before I started running CUDA, I never saw GPU temp go above 69°C. Since I run CUDA, either Seti or GPUGRID (managed to never have WUs of both apps available), I am eager to keep it below 80°C.
What I wrote about 68-78°C is the fluctiation between day (68) and night (78).
CPU - I haven't come across a good and free tool to monitor CPU temps. Speedfan shows me max 62°C on a hot day but doesn't show the different temps between each core. No spikes I can see.
memcheck86 I did not try yet. Vista's own memorytest promps no errors.
My problem installing a new driver is no more. I had to unplug the TV.
When Windows restarted ( when I wanted it to restart ) after uninstalling the driver I had to allow Vista to complete the installation of Vista's own driver for the video card. Without that it would say "you're running 32bit uninstaller.. no 64bit OS.." .. something like that.
I downclocked it to 602/1296/1107MHz to 'undo' the factory overclocking. To downclock it even further never came to my mind.
Since I have this computer, Avira AntiVir Premium is guarding me. No virus found in the last scan. |
|
|
|
Im using rivatuner to monitor GPU temps. I have a pair of GTX 295's causing me massive issues atm. Could be temp related so i ran my fans up to 80% keeping temps around 67C I have seen them mid 70's when it was crashing most. Under vista i have a sidebar apps the displays rivatuners temps i use the nvidia tool for all controls.
Edit: Ive had a heap of reboots too. Its usually a BSOD but one was windows update. Try watching the machine to see if you BSOD |
|
|
|
"good and free tool to monitor CPU temps"
RealTemp: http://www.techpowerup.com/realtemp/
CoreTemp: http://www.alcpu.com/CoreTemp/ |
|
|
|
Where I have had issues before is that if CPU temperatures reach close to 70 celcius on peak usage, my computer will shutdown or reboot.
Curiously enough, this seems to happen more when GPUGRID is doing a download and preparing to start up again.
A better fan setup seems to get rid of this problem for me. In the warmer summer months, I may just run GPUGRID with no other BOINC projects running just to keep the CPU temperatures down. |
|
|
|
I can be a PSU problem. During a D/L you now also have more disk activity. If the PSU is marginal the heavy computing load of the GPU added to the increased load of the disk drive may be enough to signal a power out event thus leading to the MB toggling reset and you get a reboot.
Room temp plays a part because the increased room temp ups the fan speed for the same loads ... also more draw on the PSU ... |
|
|
|
Hi there!
I should sheepishly go into a corner. I thought I got Speedfan figured out but that was not the case.
To be honest, I have no good defense why I read 60°C as the CPU temperature.
It was at 71°C most of the time I figured. Now, if something is broken, it might be the CPU. Maybe the Scythe Mugen prevented some seriouse damage.
Something between 1-2 month ago, the CPU-fan was only able to run at 100rpm. I was even wondering why it can keep the CPU cool without running at atleast 500rpm.
Thanks to all who kept pointing at temperatures!!!
Now, I hope the GPU did not get damaged by the CPU.
Afterall, I'm realy happy that I did not keep my Diary private.
And if the problem still occurs with GPUGRID (still wondering why then everything else did not cause reboots) I will look out for a new AM2+ CPU instead of claiming guarantee.
Checking CPU-fan = 1200rpm now; Core=55°C If I'm not mistaken again.
Will take a good look at all the programs you mentioned.
And thanks to all who kept me on the topic with your postings too! |
|
|
|
Room temp plays a part because the increased room temp ups the fan speed for the same loads ... also more draw on the PSU ...
I think even more than the current drawn by the fans, high temperatures cause more power to be used simply as a physical byproduct of temperature on the electronics. As temperature goes up, electrical resistance drops, causing amperage to go up. Of course, that causes temperatures to rise even more, which causes resistance to drop more, which causes amperage to go up more, which causes temperature to go up even more, and so on.
It's really a shame that there isn't any monitoring built into either power supplies or motherboards that let you know when you have a marginal power situation. The only clue you get is intermittent failures, and there's an awful lot of things that can cause that.
Mike
|
|
|
|
Not to be too optimistic, but the issue seems to be no more. If I read correctly , even the GPU temp is 10°C lower as before at the same GPU-Fan speed.
I guess, I have to work for GPUGRID a few more days to make sure. =)
I hope there is no long term damage.
And maybe I'm wrong, but I think the temperatures in the case are an issue sometimes too.
My PSU gets fresh air from the outside at the buttom of the case.
A fan behind the PSU sucks air from the buttom and another fan provides fresh air at the front to cool the harddisks.
One fan takes out warm air on the back behind the CPU cooler while some other fan pulls air from near the video card threw the CPU cooler up to the top.
At top a fan outtake air.
All fans are 120mm.
The last fan is a side fan to intake air (230mm).
I'm not sure about 2 more optional fans at the top could make a much better airflow.
____________
Relying on Boinc to satisfy your need for happiness might result in temporary unhappiness. This can be applied to almost everything. |
|
|
|
Hi Moabiter,
you're really very probably seeing CPU temperature related problems. Let me tell you a bit:
- the Phenom 9950 is a very hot and power consuming beast. It's got a TDP of 140W and actually needs that amount of power.
- it also doesn't like high temperatures, and/or it's temperature is measured at cooler spots compared to Intel.
- at work I had a X2 6000+ (90nm, 125W) and *caugh* under sustained load the machine would freeze or restart about once a week
- I switched to a Phenom 9850 (125W) and the situation remained (also switched the board for a rather similar one)
- during fall the freeze interval extended to 1 - 2 months, during winter it disappeared
- in spring it reappeared and I mounted an Arctic Cooling tower cooler, but surprisingly neither temps nor noise went down significantly
- when running prime95 I saw: when temps reported by Everst went to 60 - 65°C I started to get computation errors (max specified: 62°C)
- problem was that the fan was already running at full blast!
- this story was meant to show you how critical these hot headed chips can be regarding temperature. Now on to your case: GPU-Grid works the GPU harder than SETI and thus heats the interior of your case even more. This increases cpu temps and this likely pulled you above the stability threshold. That's why running 3 instead of 4 cpu tasks helped: now the cpu generated less heat itself.
- what I did: using "K10Stat" I can set power profiles for my cpu. I downclock it and concurrently lower its voltage. This way I can choose how much power it is allowed to consume, so that it stays at~55°C and the noise stays tolerable. If you like I could tell you how to use it, you'd just have to invest a few hours into stability testing to avoid buying a new cpu.
- some values:
2.50 GHz | 1.30 V | 125 W (stock setting, much too much power draw!)
2.40 GHz | 1.175V | 98 W (barely manageable)
2.30 GHz | 1.150V | 90 W (works)
2.20 GHz | 1.125V | 82 W
2.10 GHz | 1.10 V | 75 W (so far didn't have to go lower than this)
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
Hi MrS!
Thanks for sharing your experience! Very interesting!
According to PCGamesHardware, a german computer magazin, for a Phenom 9950BE temperature should not go above 64°C.
I also lied about my PSU. I checked it as I said, but I looked at a wishlist that was not uptodate.
I planed on buying a 650W Power Supply, but when I had the opportunity to get 'whatever I want' as a gift I chose a BeQuiet Straight Power 700W.
Expecting 8-10 more degrees in summer, I prefer to add more fans to the CPU-Cooler instead of lowering its clock. Up to 4x120mm fans can be assembled.
I am sure, to have K10Stats as an option if nothing else helps is great.
On the other hand, the possibility to do more damage using this program is not realy an option.
For my Diary:
outside: 22°C;
CPU - 55°C fan 1257rpm
GPU - 68°C fan 65%
If I am not mistaken. GPUGRID + 4 CPU WUs running since I fixed the CPU fan. |
|
|
|
It's quite possible that fixing the cpu fan speed will be enough to keep you stable. And I'd like to point out that what damages your hardware are voltage and temperature. That's why overclocking with increased voltages shortens the chips life span, whereas just increasing the clock speed doesn't change much.
You can choose the reverse way and instead of overclocking you can lower the cpu voltage, be it underclocked or at stock clock speeds. This is actually better for your hardware and your power bill :)
It doesn't matter for me if you choose to do so or not.. I just wanted to make your options clear and that there's no danger involved here (ecept failing a stress test + reboot / freeze if you set the voltage too low).
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
Interesting. I haven't bought an AMD CPU for nearly two years, ever since the Core2 line came out. The big reason I've been only buying Intel since then is the lower heat/power on the Core2 chips. Especially on laptops, this translates into much longer batter life.
It would be interesting to find out if there's a correlation between the assorted problems some people are having and whether their CPUs are Intel or AMD.
It could indeed be that these hard to track down problems with running GPUGRID are related to CPU temperature. Even the double width GPUs which vent most of their heat out the back still exhaust some of the cooling air back into the computer case, where it will contribute to CPU heating. |
|
|
|
It's not generally AMD cpus, it's only those ones where AMD desperately pushed clock speed a little too much (in my opinion). Most of their lineup is fine, though. The troublemakers: Phenom I 9850, 9950, Athlon 64 X2 6000+ and 6400+ in 90 nm. The new 45 nm Phenom II are surprisingly competitve with Core 2 Quads (price, performance and power), as long as no OCs are involved. With OC Intel still handily wins.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
The new 45 nm Phenom II are surprisingly competitve with Core 2 Quads (price, performance and power), as long as no OCs are involved. With OC Intel still handily wins.
It's good to hear that AMD is becoming competitive again. I like being the beneficiary of having two CPU manufacturers slugging it out. It's good for the wallet and good for the GLFOPS.
The Phenom Is, which came out shortly after Core2, were no match for the Intel chips in either speed or power consumption. I haven't had a chance to play with the Phenom IIs yet, so thanks for that information.
Mike |
|
|
|
In general, after buying a new computer I don't take much care about comparing components I have with new ones.
Everytime you buy new hardware 1 month later you get it for less and/or you get something much better.
Every view years in the past I've had the opportunity to get new hardware. The Phenom X4 9950BE I chose because the price was right and I didn't want to go below 2.6GHz which was the clock speed of my previous CPU.
Well, I keep an eye on new hardware to know what to buy if something is broken, but until then I am happy to be able to use what I got.
For now I have fun (again) with this somewhat factory overclocked and hungry central processing unit.
Whenever I become tired of playing games and bored to run BOINC, then I am more than willing to have a small 'GREEN' barebone just for internet, ITunes, TV and watching BlueRays or what ever I want to do in the future. =) |
|
|
|
I'm very confident now, the defect CPU-Fan caused the reboots. No restarts for almost 2 days.
I plan to get one of those Noctua fans to have at least one backup fan to jump in if one fails.
They also seem to have more airpressure than the current Scythe fan which came with the Mugen.
If you know of any silent but good CPU-Fans, I would be glad if you tell me.
Hopefully, one day grafic card cooler/fan get a boost in performance/dB too. (Larabee?) |
|
|
|
I haven't bought an AMD CPU for nearly two years, ever since the Core2 line came out. The big reason I've been only buying Intel since then is the lower heat/power on the Core2 chips.
Agreement. I bought one Core2Duo and three Core2Quad for the same reason in the last two years, but two months ago one X4 810: it was cheap, shows lower heat and power consumption and can be cooled with the included boxed cooler, great. Why spend a lot of money for a high clocked or overclockable CPU if a GPU is more valuable for science resp. the investment is more effective ? Overclocking CPUs increases the consumption disproportional, howerver Intel quads as well as the new AMD quads can be undervolted with good prospects. (Some weeks last year I thought about buying a X4 9150/9350 but luckily I decided against it.)
In the beginning of overclocking I had some events of restart or rebooting while BOINC manager is running, but since those experiences I did always fundamental tests with benchmark software before start crunching workunits.
If you know of any silent but good CPU-Fans, I would be glad if you tell me.
Perhaps take a look at OCZ Vendetta 2 or EKL Groß-Clock'ner (one of my best coolers), I would not spend a lot of money in cooling just to reach some lower degrees, but I fear (CPU) monsters can only be fighten with (cooler) monsters. |
|
|
|
@Snow Crash
"good and free tool to monitor CPU temps"
RealTemp: http://www.techpowerup.com/realtemp/
CoreTemp: http://www.alcpu.com/CoreTemp/
CoreTemp doesn't show me temperatures for individual cores.
My CPU is not supported by RealTemp.
Thanks anyways =)
QX1900AIW
These are some nice coolers ;)
I was just surprised by the monitor turning black, turned off mouse/keyboard LEDs but no reboot this time. I had to turn off the PC by holding the switch down for 5 seconds.
The CPU-Fan I thought I fixed is turned out to make problems again. Sometimes it goes below 800rpm without being told to do so. 24°C outside did the rest I guess. Strangely, the NVIDIA was below 68°C.
A new CPU-fan is coming soon together with some cooling paste.
For my Diary:
Never buy computers in autumn!
____________
Relying on Boinc to satisfy your need for happiness might result in temporary unhappiness. This can be applied to almost everything. |
|
|
|
Sounds like you have a bios setting for the cpu fan control, which you have to override manually upon every boot. You might as well fix the speed in bios (or switch off the automatic).
And that Noctua fan should be quite good!
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|