Message boards : Graphics cards (GPUs) : System crash when entering sleep while GPU task running
Author | Message |
---|---|
About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log. The simulation has become unstable. Terminating to avoid lock-up (1) Unfortunately I can't tell if this message was written before or after the reboot. NOTES: no overclocking, last recorded temp seems reasonable at 67'C. I'm not running a bleeding-edge driver (399.07) but I note this was also occurring with an older driver (391.35). Now putting my developer's hat on (although I'm not a systems developer) my hunch here would be that some kernel memory corruption has occurred, and when Windows comes to checkpoint its processes it barfs. This is also causing data corruption outside of BOINC. A number of recently-written files end up "zero-padded", i.e. their content replaced with 0x00's. (Probably something to do with SSD / write caching.) I've had to manually repair my git repo a number of times now, for example. I'm starting to worry if other files might have been zapped that I just haven't noticed yet. I realise it's hard to separate cause & effect from the symptoms. For example, could the crash be due to something else, which then zeros out some of GPUGrid's files and that's why IT fails? From a low sample size of less than a dozen such crashes, I can only say that it never occurred when I didn't run BOINC (i.e. before this Winter). If it would help, I can send the resulting crash-dump to a developer (230MB 7-Zipped). PM me if interested. | |
ID: 51530 | Rating: 0 | rate: / Reply Quote | |
Quick question for you. | |
ID: 51534 | Rating: 0 | rate: / Reply Quote | |
> Did you tell BOINC to stop processing when you exit or did you just exit BOINC and leave the task crunching? | |
ID: 51536 | Rating: 0 | rate: / Reply Quote | |
About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log. GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. I always close BOINC with exiting science apps, then watch MSI afterburner tray monitors (GPU usage and temperature) to go down. After that I restart / turn off my PC. You should do the same to avoid such errors. (until this bug get fixed = forever.) When your PC turned on (by a timer or you) from a complete shutdown, BOINC will continue the tasks in it. (You don't need to suspend them with the OS.) If you have a password protected user account, it won't log on automatically at startup unless you set Windows not to ask for a username and password at startup. You can specify which user account to log on at startup by the following method: Press Windows key + R Type control userpasswords2 press [Enter] or click [OK], then uncheck the checkbox, click [OK] and type your username and password (twice) then click [OK].If your PC is connected to a domain, you should specify your username as domain/username. I also recommend to turn off the "fast system startup" option in power management. | |
ID: 51538 | Rating: 0 | rate: / Reply Quote | |
GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. This is exactly the procedure that I use and it works. Never shut down the PC until the GPUGrid app stops. Avoid sleep and hibernate. Also turn off write caching on the BOINC drive. If you take these steps your errors should disappear. Listen to Zoltan when he gives GPUGrid advice. :-) | |
ID: 51540 | Rating: 0 | rate: / Reply Quote | |
I 3rd that. Shutting down the OS with BOINC running can result in computation errors, especially GPU tasks. | |
ID: 51541 | Rating: 0 | rate: / Reply Quote | |
Thanks for the tips folks. GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. Right, that's good to know. I think my next step will be to try a middle-ground approach: I'll use "Snooze GPU" tasks and wait a few minutes for those to idle properly before putting my PC to sleep. I can watch for the GPU ram dropping in Process Explorer to confirm. I'd be willing to bet (at least a few fillér) that this will improve it. I'm guessing the OS being too impatient at suspending the apps that doesn't let GPUGrid checkpoint safely. Worth a shot anyway. Also turn off write caching on the BOINC drive. That would be safer yes, however that would hurt my code compile times, and I suspect that even if I segregated BOINC to a separate SSD I'm not sure whether the main drive would be saved from a hard kernel crash - so I think I'd need to do it on both of my SSDs. Fingers crossed suspending GPUGrid will work! I'll report back. (FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.) | |
ID: 51544 | Rating: 0 | rate: / Reply Quote | |
Clarification: I say 'checkpoint' in the sense of Windows recording the state of the processes rather than the Boinc term for tasks saving their progress. I've just realised that could be ambiguous. (Hmm, now do processes need to be checkpointed for sleep, or is that just for hibernate I wonder...) | |
ID: 51545 | Rating: 0 | rate: / Reply Quote | |
(FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.)That feature can't save the data from the write cache of the OS. That's why disable write caching is recommended, or a good UPS. (is it a Samsung PRO SSD?) These SSDs have a relatively large DDR RAM cache, so you should give a try disabling write caching in the OS, and check how much it hurts code compilation times. Perhaps you loose less time than the time spent repairing things. | |
ID: 51552 | Rating: 0 | rate: / Reply Quote | |
Yeah, had a UPS. It died, so I was hoping this SSD would suffice instead of more lead-acid batteries. That feature can't save the data from the write cache of the OS. Indeed. I guess I'd hoped that the "safer" level of write caching would be good enough with the SSD's capacitors. But I suppose my hopes for this complex pipeline involving multiple vendors was too high. :) so you should give a try disabling write caching in the OS Promising result! After disabling all write caching, a Java full compile goes from 48 -> 52s. But then: I've got System/Data partitions on my Samsung EVO 960 (non-pro I think) and a small Intel SSD as a temp drive. I've already configured it so the bulk of writes during a compile go to the Intel, so I only need to disable write caching on the Samsung. Looks like this will: protect my data better, keep my 48s compile times, and hopefully avoid a new UPS still. Win win win. (I've got a similar amount of C++, which takes ~9m to build, but I think that would turn out similarly.) Perhaps you loose less time than the time spent repairing things. True. I guess my bigger concern is that as compile times increase I become more likely to lose focus, flick over and check the news, and if something catches my interest "Poof!" 15 minutes is gone. :-D | |
ID: 51571 | Rating: 0 | rate: / Reply Quote | |
A quick update: I've since had two such hard shutdowns out of 8 attempts at sleeping. | |
ID: 51598 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : System crash when entering sleep while GPU task running