Message boards : Graphics cards (GPUs) : Posted on BOINC Alpha; re: 6.6.20
Author | Message |
---|---|
I posted this on the Alpha mailing list, those that have 4 and 8 core systems may want to consider how your systems handle workloads in light of these observations ... if you are only running one or two projects these notes may not apply ... comments, as always, welcome ... One of my long standing complaints about BOINC is with the CPU scheduler and the fact that the more CPUs you have the less optimal the choices seem to be as far as scheduling work. The CPU scheduler seems to be modeled and tested primarily on single CPU systems with occasional duals. Most of the problems I have noted are usually only readily apparent when you look for them on 4 CPU systems or better. My first trip down this rabbit hole was over 4 years ago when I had one of the first 4 CPU systems and JM VII was lead on the development of the CPU scheduler. | |
ID: 8226 | Rating: 0 | rate: / Reply Quote | |
I think you are being a bit vague about the problems you are experiencing due to this scheduling issues. It's only a guess that the recent GPU-Grid issue is caused by the BOINC scheduler. | |
ID: 8228 | Rating: 0 | rate: / Reply Quote | |
I think you are being a bit vague about the problems you are experiencing due to this scheduling issues. It's only a guess that the recent GPU-Grid issue is caused by the BOINC scheduler. You are correct, it is only a guess ... sans more information from the GPU Grid project, which I have repeatedly asked for in the other thread where I discuss this/that issue, I can only go on my instincts and observations. Using 6.5.0 (and a couple other versions) with GPU Grid I have never seen a GPU Grid task suspended while other tasks are then run in advance of previously running tasks. In that most tasks running on this system only take about 6 hours there is little need to do this type of suspension and switching in that on average I complete a task roughly once every 90 minutes... And, as to it NOT being 6.6.20 ... well, then I would have expected to have seen it ALREADY since I down leveled ... and I have not ... I was seeing it almost continuously with at least one task with 6.6.20 and there was no correlation of the task name (id) and the task in trouble. As to the other, you misread what I said ... long running tasks would still switch out at the normal switch interval as determined by the participant (default 60 minutes). I do have leave tasks in memory because historically several project's tasks do not take well to suspension and removal from memory. But, this does lead to the issue of a big memory footprint if the CPU Scheduler misbehaves and starts more tasks than needful. But the core issue is that the developers seem to have a chunk of code that is operationally optimal for systems with fewer resources. As I stated I first started observing issues nearly 4 years ago when I got my first 4 core system. And, the internal model does not seem to have been changed. Sadly, I suspect that one of the reasons that this occurs is that it is likely that the "best" system that the developers use for testing has 4 cores or less because we all know that the developers are almost always resource starved. But the second problem is that I doubt that the spend hours staring at the execution patterns as I do ... I have two monitors and work on one and the other is always logged onto a BOINC instance and i watch it out of the corner of my eye and note changes. As to your last point, sadly, you may be correct ... the BOINC Development team has a long history of ignoring suggestions as to how to improve BOINC ... up to and including developed and tested code changes ... | |
ID: 8237 | Rating: 0 | rate: / Reply Quote | |
Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present). | |
ID: 8250 | Rating: 0 | rate: / Reply Quote | |
Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present). I'm running v6.6.20 on 7 machines (5 of which are quads) and have seen no problem with suspending tasks excessively. I have seen a GPUGRID task suspended in favor of running a new one but that's because the newly downloaded task had an earlier due date than the one already running. It seems due dates have been shortened. That's a choice by the GPUGRID admins and doesn't reflect a BOINC problem. | |
ID: 8271 | Rating: 0 | rate: / Reply Quote | |
Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present). It does if the resumed task or other tasks never suspended change from taking 6 hours to complete to 24 hours ... I am not sure if the issue is with the tasks or with 6.6.20 or what. But, when the only thing changed is 6.6.20 to 6.5.0 and the problems go with the change ... well ... In my case, the quad is the number of GPUs I have running tasks ... on an i7 ... Anyway, for the nonce, I am back operational. I reported the problem which I suspect is nothing more than an exaggeration of a long standing issue that the developers love to ignore because it is inconvenient for them ... :) But, if what I suspect is true this is going to bite them pretty big time real soon when more people and projects start doing GPU work ... Well, I suspect I will quit asking soon as they keep feeding me the Oz line about the man behind the curtain ... | |
ID: 8276 | Rating: 0 | rate: / Reply Quote | |
It may not be the perfect thread for this, but since the dicussion got to that point already: | |
ID: 8282 | Rating: 0 | rate: / Reply Quote | |
6.6.20 has other issues. | |
ID: 8294 | Rating: 0 | rate: / Reply Quote | |
Paul, could you switch the i7 back to 6.6.20 and see if you can get the hanging tasks again? If you could observe this we'd know there's a serious problem. Otherwise we can only speculate so much. I was afraid someone was going to ask me to do that ... sigh ... Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ... If I am well enough tomorrow I will give it a shot for a couple hours (I had a REAL bad day today and am not in good shape, sorry). I think I can tell if it is behaving badly and for the heck of it I will turn on logging and see if I can capture something that will indicate what is going on and making those tasks run badly. The real pits is I still need to finish my taxes ... | |
ID: 8296 | Rating: 0 | rate: / Reply Quote | |
Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ... i dont think thats limited to 6.6.20. when i downgraded to 6.5.0 my cpu switched projects about 1 a sec for a couple of minutes till i suspended everything but one cpu project. turning them back on 1 at a time let everything resume normally. a brief excerpt from the log: 07-Apr-2009 18:59:05 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:06 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:06 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:07 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:08 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:10 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:11 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:12 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:13 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:14 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:15 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:16 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:17 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:19 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:20 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:21 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:22 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:23 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:25 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:26 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:27 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:28 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:29 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:30 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:31 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:32 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:34 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103 07-Apr-2009 18:59:35 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:36 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19 07-Apr-2009 18:59:37 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:38 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19 07-Apr-2009 18:59:39 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:40 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:42 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19 07-Apr-2009 18:59:43 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 07-Apr-2009 18:59:44 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154 07-Apr-2009 18:59:45 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19 07-Apr-2009 18:59:46 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501 | |
ID: 8298 | Rating: 0 | rate: / Reply Quote | |
Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ... Actually this problem was first described by a guy named Paul D. Buck about 3 or 4 years ago. He noted it with the then current application on his brand new Dell 4 Dual processor (with HT) systems. ... :) I am sorry if you got the impression that I was describing a problem that affects only 6.6.20 ... it is a long standing problem that arises from the design of the CPU Scheduler that has as one of its "Prime Directives" to not miss deadlines. The problem is quite simply that the most effective strategy when running on a single resource is not the best strategy when you have multiple processing resources. The most common side effects are that there will be more than the expected number of tasks in "Waiting to Run" state (WTR). For example I have a task switch interval (TSI) of 720 minutes (12 hours) and virtually all tasks on that system should run to completion if BOINC honors TSI. But it doesn't, and so right at this moment I have 6 tasks in WTR. The second symptom is that BOINC seemingly changes its mind on what to run on a moment to moment basis with no observable reason. Again, I have a low queue of 1 days, 8+4 processing elements and by actual counts have 1-1.4 days of work on hand, earliest deadline is 4 days hence and BOINC is randomly starting and stopping tasks. Some tasks are started, run for seconds to minutes and then suspended for hours before being run again. If the task was in such need of being run, why after it was suspended why isn't it the first task restarted? Anyway JM VII and I are debating this and he is insisting that all is well ... the problem is that it is noticeable on a 4 processor system but only glaringly obvious on an 8 processor system. The other thing that is going on is that BOINC Is running the internal model as much as 5 times a minute ... to my mind that is also madness. The processing needs and deadline issues are not going to change that much in that short of a timeframe. Heck, on an the system in question I counted the tasks done in 24 hours and it was 252 with FreeHAL counted and 219(?) without. Fundamentally, I was averaging a task completion once every 6 minutes. This was confirmed with another count over another 12 hour period. To my mind, that means that the next task in deadline trouble can be scheduled to the next free resource and run then ... | |
ID: 8301 | Rating: 0 | rate: / Reply Quote | |
Paul, | |
ID: 8303 | Rating: 0 | rate: / Reply Quote | |
Michael, | |
ID: 8305 | Rating: 0 | rate: / Reply Quote | |
Feel free to summarize and repost. I'm not up to leading this crusade right now. | |
ID: 8311 | Rating: 0 | rate: / Reply Quote | |
I started a 6.6.21 thread in that there is a version of that name available... | |
ID: 8317 | Rating: 0 | rate: / Reply Quote | |
6.6.20 has replaced 6.4.7 as the official BOINC release. | |
ID: 8326 | Rating: 0 | rate: / Reply Quote | |
Yes, and there are a couple potential major issues that may bite us in the butt too ... | |
ID: 8328 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : Posted on BOINC Alpha; re: 6.6.20