Advanced search

Message boards : Graphics cards (GPUs) : no cuda work requested

Author Message
HPew
Send message
Joined: 7 Apr 09
Posts: 10
Credit: 534,714
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwatwat
Message 8686 - Posted: 21 Apr 2009 | 19:41:12 UTC

I've just installed boinc 6620 on WinXP, whenever boinc asks for new work the server returns these three messages:

No work sent
Full-atom molecular dynamics on Cell processor is not available for your type of computer.
cuda app exists for Full-atom molecular dynamics but no cuda work requested.


Why would an absolutely default set-up not work properly?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8694 - Posted: 21 Apr 2009 | 21:56:36 UTC - in response to Message 8686.

Your driver may be too old or, more likely, your GPU is not supported. Can't say if your computers are hidden, though.

MrS
____________
Scanning for our furry friends since Jan 2002

HPew
Send message
Joined: 7 Apr 09
Posts: 10
Credit: 534,714
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwatwat
Message 8696 - Posted: 21 Apr 2009 | 22:01:07 UTC

I. Am. Not. An. Idiot.

The card is a G92, the driver is the latest from nvidia as of yesterday--182.50. The system has crunched two WUs but is being refused further units.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8699 - Posted: 21 Apr 2009 | 22:06:56 UTC - in response to Message 8696.

I. Am. Not. An. Idiot.

The card is a G92, the driver is the latest from nvidia as of yesterday--182.50. The system has crunched two WUs but is being refused further units.

No one said you were.

But when you hide your computers some of the questions ETA would have answered with a peek there.

In that these are the most common problems, they are also the most offered solutions.

There are issues with the 6.6.20 and 6.6.23 versions that affect some and not others. Next suggestion is to do a project reset on GPU Grid. If that does not work, reset all debt ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8700 - Posted: 21 Apr 2009 | 22:07:13 UTC - in response to Message 8696.

I. Am. Not. An. Idiot.


Well then, sorry.. but from your post there was no way to tell this.
Do you still ahve some WUs running or are you dry?

MrS
____________
Scanning for our furry friends since Jan 2002

Alain Maes
Send message
Joined: 8 Sep 08
Posts: 63
Credit: 1,651,003,720
RAC: 2,189,013
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8704 - Posted: 21 Apr 2009 | 22:18:55 UTC - in response to Message 8696.

I. Am. Not. An. Idiot.



Of course you are not, since you are capable of asking a perfectly acceptable question. But please also accept that ETA is one of the most respected persons, as anyone else here is respected by definition untill prov otherwise; he and all others are just trying to help within means and capabilities, no offence intended.
His first reaction is also pretty standard for the ones that followed this forum, since the possible reasons for failure he mentioned are pretty commun even for "not idiots".
So if you really want serious help, please describe your system and problem in more detail. Unhiding your computers will help a lot in this since it will allow to see the results of the failing WUs including any error messages.
Hope we will be able to help you to help science and humanity.

kind regards.

Alain

HPew
Send message
Joined: 7 Apr 09
Posts: 10
Credit: 534,714
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwatwat
Message 8705 - Posted: 21 Apr 2009 | 22:27:13 UTC - in response to Message 8700.
Last modified: 21 Apr 2009 | 22:29:11 UTC

The afflicted PC will run out of WUs around 4 AM. The message tab is filled with red.

At some more reasonable hour I'll detach gpugrid and re-attach to see if that fixes it.

Maes: The WUs are not failing, the server is refusing to give me more.

Apology to ETA: Sorry for my abruptness.

Profile Phil Klassen
Send message
Joined: 6 Sep 07
Posts: 18
Credit: 14,764,147
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 8707 - Posted: 22 Apr 2009 | 1:57:03 UTC - in response to Message 8705.

I had the same message a few hours ago on my I7 with 3 gtx cards, I just played a game, rebooted, and then it downloaded some work??? Not sure cuz I have ps3's and gpu's running. The message came up on my i7 then it fixed itself. Maybe the reboot had something to do with it.
____________

Profile (_KoDAk_)
Avatar
Send message
Joined: 18 Oct 08
Posts: 43
Credit: 6,924,807
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwat
Message 8712 - Posted: 22 Apr 2009 | 9:07:45 UTC

GPU Results ready to send 0

HPew
Send message
Joined: 7 Apr 09
Posts: 10
Credit: 534,714
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwatwat
Message 8723 - Posted: 22 Apr 2009 | 14:45:51 UTC

*Sigh* The message tab shows an 'ask & refusal' every hour or so, but when this machine had completely run out of work and I manually updated it was given a single new WU amid all the refusals.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8739 - Posted: 22 Apr 2009 | 20:17:50 UTC - in response to Message 8705.

Apology to ETA: Sorry for my abruptness.


You're welcome :)

Let's try to solve your problem then. Today I remember that I also got the message "no cuda work requested" when I tried 6.6.20. I quickly reverted to 6.5.0 and the box is running fine since then. You could also try 6.6.23, which supposedly fixed some of the issue of 6.6.20.

Until now I didn't see anyone else posting such behaviour of 6.6.20, so it seems to be a rare case.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8758 - Posted: 23 Apr 2009 | 5:11:53 UTC - in response to Message 8739.
Last modified: 23 Apr 2009 | 5:13:58 UTC

Apology to ETA: Sorry for my abruptness.


You're welcome :)

Let's try to solve your problem then. Today I remember that I also got the message "no cuda work requested" when I tried 6.6.20. I quickly reverted to 6.5.0 and the box is running fine since then. You could also try 6.6.23, which supposedly fixed some of the issue of 6.6.20.

Until now I didn't see anyone else posting such behaviour of 6.6.20, so it seems to be a rare case.

Um, no...

I am becoming less and less convinced that it is isolated. Sorry ... I thought I was being clear.

It looks like both 6.6.20 and 6.6.23 have a problem with debt accumulating in one direction and never being properly updated. The eventual result on GPU Grid is that you get fewer and fewer tasks in pending till you start to run dry. Version 6.6.20 had some other problems with suspending tasks and something else that could really mess things up and that I think was the source of the tasks that took exceptionally long times to run. 6.6.23 seems to have fixed that. This 6.6.20 problem may mostly affect people running multiple GPU setups.

But, that means that 6.6.20 and 6.6.23 are not; in my opinion, ready for prime time.

I *DO* like the new time accounting so that you can more accurately see what is happening with the GPU Grid tasks so for the moment I am personally sticking with 6.6.20 on one system and 6.6.23 on my main but that is also because I am trying to call attention to these issues and the only way to collect the logs is to run the application. Sadly, as usual, the developers don't seem to be that responsive to feedback ...

To put it another way, they are very good at ignoring answers to questions they don't want asked.

{edit}

FOr those having any kind of problem with 6.6.x, try 6.5.0 and if the problem goes away, stay there. Sadly I will stay on the point and will be sending reports from the front as I get them. Failing that, you can always ask directly ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8797 - Posted: 23 Apr 2009 | 19:27:15 UTC - in response to Message 8758.

No work sent
Full-atom molecular dynamics on Cell processor is not available for your type of computer.
cuda app exists for Full-atom molecular dynamics but no cuda work requested.


I understand this message in the way that BOINC does request work from GPU-Grid, but it does not request CUDA work (which wo9uld be extremely strange / stupid) and hence the server is not sending CUDA work.
Am I totally wrong here?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Stefan Ledwina
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 221,007,857
RAC: 4,333,521
Level
Leu
Scientific publications
watwatwatwatwatwatwatwat
Message 8799 - Posted: 23 Apr 2009 | 19:36:55 UTC - in response to Message 8797.

I also understand it that way...
____________

pixelicious.at - my little photoblog

Profile Bymark
Avatar
Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 8801 - Posted: 23 Apr 2009 | 19:57:32 UTC - in response to Message 8799.
Last modified: 23 Apr 2009 | 20:52:35 UTC

I think this is normal if chase 0.
My one computer with a 250:

Network usage Computer is connected to the Internet about every (Leave blank or 0 if always connected. BOINC will try to maintain at least this much work.) 0 days Maintain enough work for an additional Enforced by version 5.10+ 0 days

Seti is more polite:
Running GPUGRID and Seti 20 / 1 and I always have GPU work on both.


23.4.2009 03:11:27 GPUGRID Sending scheduler request: To fetch work.
23.4.2009 03:11:27 GPUGRID Requesting new tasks
23.4.2009 03:11:32 GPUGRID Scheduler request completed: got 0 new tasks
23.4.2009 03:11:32 GPUGRID Message from server: No work sent
23.4.2009 03:11:32 GPUGRID Message from server: Full-atom molecular dynamics on Cell processor is not available for your type of computer.
23.4.2009 03:11:32 GPUGRID Message from server: CUDA app exists for Full-atom molecular dynamics but no CUDA work requested
23.4.2009 03:44:08 malariacontrol.net Sending scheduler request: To fetch work.
23.4.2009 03:44:08 malariacontrol.net Requesting new tasks
23.4.2009 03:44:13 malariacontrol.net Scheduler request completed: got 0 new tasks
23.4.2009 05:30:30 malariacontrol.net Sending scheduler request: To fetch work.
23.4.2009 05:30:30 malariacontrol.net Requesting new tasks
23.4.2009 05:30:35 malariacontrol.net Scheduler request completed: got 0 new tasks
23.4.2009 05:30:35 malariacontrol.net Message from server: No work sent
23.4.2009 05:30:35 malariacontrol.net Message from server: No work is available for malariacontrol.net
23.4.2009 05:30:35 malariacontrol.net Message from server: No work is available for Prediction of Malaria Prevalence
23.4.2009 06:07:50 SETI@home Sending scheduler request: To fetch work.
23.4.2009 06:07:50 SETI@home Requesting new tasks
23.4.2009 06:07:55 SETI@home Scheduler request completed: got 0 new tasks
23.4.2009 06:07:55 SETI@home Message from server: No work sent
23.4.2009 06:07:55 SETI@home Message from server: No work available for the applications you have selected. Please check your settings on the web site.
23.4.2009 06:07:55 SETI@home Message from server: CPU jobs are available, but your preferences are set to not accept them
____________
"Silakka"
Hello from Turku > Ã…bo.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8803 - Posted: 23 Apr 2009 | 21:16:20 UTC - in response to Message 8797.

No work sent
Full-atom molecular dynamics on Cell processor is not available for your type of computer.
cuda app exists for Full-atom molecular dynamics but no cuda work requested.


I understand this message in the way that BOINC does request work from GPU-Grid, but it does not request CUDA work (which wo9uld be extremely strange / stupid) and hence the server is not sending CUDA work.
Am I totally wrong here?

No, but the BOINC client is.

We may be chasing two bugs here. I am seeing unconstrained growth of GPU debt which essentially causes BOINC to stop asking for work from GPU Grid (another guy on Rosetta has it stopping asking for work from Rosetta so it is not simply a GPU side issue) ... Richard Haselgrove has been demonstrating that the client may be dry of work for GPU but insists on asking for CPU work, the inverse of what it is supposed to be doing.

I am running 6.6.23 where the problem seems to be more acute than 6.6.20, which I am running on my Q9300 where I don't seem to be seeing the same issue, yet.

Sorry ETA I am not doing well and the brain is slightly mushy so I may not be as clear as usual... I keep thinking I have explained this ...

I am going to PM you my e-mail address so you can send me wake-up idiot calls (we can also Skype if you like ... hey that rhymes) ...

My bigger point is that AT THE MOMENT ... I cannot recommend either 6.6.20 or 6.6.23 wholeheartedly. 6.6.20 I am pretty sure has a bug that really causes issues on multi-GPU systems and may cause improper suspensions and long running tasks (though it does not seem to be doing that on the Q9300 at the moment (single GPU Though). 6.6.23 has fixes for a couple GPU things but seems to have a broken debt issue (which MAY also exist in 6.6.20, just that the bug fix for one thing exposed the bug ... or the bug fix is buggy ... or the bug fix broke something else ... you get the idea ...

Which is why I suggest if anyone is having work fetch issues, fall back to 6.5.0 and if they go 'way, then stay ... or get used to resetting the debts every day or so ... (which causes other problems) ...

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 8809 - Posted: 23 Apr 2009 | 23:44:33 UTC - in response to Message 8803.


We may be chasing two bugs here. I am seeing unconstrained growth of GPU debt which essentially causes BOINC to stop asking for work from GPU Grid (another guy on Rosetta has it stopping asking for work from Rosetta so it is not simply a GPU side issue) ... Richard Haselgrove has been demonstrating that the client may be dry of work for GPU but insists on asking for CPU work, the inverse of what it is supposed to be doing.

My bigger point is that AT THE MOMENT ... I cannot recommend either 6.6.20 or 6.6.23 wholeheartedly. 6.6.20 I am pretty sure has a bug that really causes issues on multi-GPU systems and may cause improper suspensions and long running tasks (though it does not seem to be doing that on the Q9300 at the moment (single GPU Though). 6.6.23 has fixes for a couple GPU things but seems to have a broken debt issue (which MAY also exist in 6.6.20, just that the bug fix for one thing exposed the bug ... or the bug fix is buggy ... or the bug fix broke something else ... you get the idea ...

Which is why I suggest if anyone is having work fetch issues, fall back to 6.5.0 and if they go 'way, then stay ... or get used to resetting the debts every day or so ... (which causes other problems) ...


Have you read this about GPU Work Fetch in 6.6.* aslo GpuSched from 6.3.*

On the face of it, it looks to me that this design harms those that dedicate 100% effort to an individual project as the LTD will eventually become too little.

If something has happened between 6.6.20 and 6.6.23 its probably worth looking at the changesets from 17770 to 17812. I didn't see anything that struck me as obvious.

An earlier commit 17544 looks potentially interesting, which came out in 6.6.14.

Rob

JAMC
Send message
Joined: 16 Nov 08
Posts: 28
Credit: 12,688,454
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwat
Message 8813 - Posted: 24 Apr 2009 | 1:58:21 UTC
Last modified: 24 Apr 2009 | 2:49:23 UTC

One of my quads running XP Home, 6.5.0, (2) GTX 260's has stopped requesting new work and also spitting these messages... manual updates with other projects suspended still requests 0 new tasks... this rig has been running for many weeks without problems, down to 1 task running- what's up??


4/23/2009 8:50:44 PM|GPUGRID|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks
4/23/2009 8:50:49 PM|GPUGRID|Scheduler request completed: got 0 new tasks

4/23/2009 8:51:41 PM|GPUGRID|Started download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3
4/23/2009 8:51:42 PM|GPUGRID|Temporarily failed download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3: HTTP error
4/23/2009 8:51:42 PM|GPUGRID|Backing off 42 min 59 sec on download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3

4/23/2009 8:52:07 PM|GPUGRID|Started download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_2
4/23/2009 8:52:08 PM|GPUGRID|Temporarily failed download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_2: HTTP error
4/23/2009 8:52:08 PM|GPUGRID|Backing off 3 hr 42 min 3 sec on download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_2
4/23/2009 8:52:09 PM|GPUGRID|Started download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_1

4/23/2009 8:52:10 PM|GPUGRID|Temporarily failed download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_1: HTTP error
4/23/2009 8:52:10 PM|GPUGRID|Backing off 3 hr 35 min 18 sec on download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_1

4/23/2009 8:52:14 PM|GPUGRID|Started download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3
4/23/2009 8:52:15 PM|GPUGRID|Temporarily failed download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3: HTTP error
4/23/2009 8:52:15 PM|GPUGRID|Backing off 2 hr 47 min 53 sec on download of m110000-GIANNI_pYIpYV1604-7-m110000-GIANNI_pYIpYV1604-6-10-RND_3

Well I guess not requesting any new work as there are 3 tasks stuck repeatedly trying to download in Transfers- 'HTTP error'??

Never mind... aborted all 3 downloads and got 3 new ones...

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8831 - Posted: 24 Apr 2009 | 12:51:41 UTC

Posted this this morning:

Ok, I have a glimmer, not sure if I got it right ... but let me try to put my limited understanding down on paper and see if one of you chrome domes can straighten me out.

In the design intent (GpuWorkFetch) we have the following:

A project is "debt eligible" for a resource R if:

• P is not backed off for R, and the backoff interval is not at the max.
• P is not suspended via GUI, and "no more tasks" is not set
Debt is adjusted as follows:

• For each debt-eligible project P, the debt is increased by the amount it's owed (delta T times its resource share relative to other debt-eligible projects) minus the amount it got (the number of instance-seconds).
• An offset is added to debt-eligible projects so that the net change is zero. This prevents debt-eligible projects from drifting away from other projects.
• An offset is added so that the maximum debt across all projects is zero (this ensures that when a new project is attached, it starts out debt-free).

What I am seeing, and my friend on GPU Grid/Rosetta is seeing, is a slow by inexorable growth of debt that eventually "chokes" off one project or another. I THINK I can explain why we are seeing different effects. His is easier.

He is dual project, Rosetta and GPU Grid. His ability to get Rosetta work is choking off.

The problem is that his debt is growing on Rosetta because of GPU Grid's lack of CPU work. So, BOINC "thinks" that GPU Grid is "owed" CPU time and is vainly trying to get work from that project. Eventually, because RS is now biased by compute capability, the multiplier drives his debt into the dirt pretty fast and soon he has trouble getting a queue of CPU work from Rosetta. Because the client wants to get CPU work from GPU Grid to restore "balance".

I have the opposite problem for the same reason. But, mine is because I have 4 GPUs in an 8 core system so my bias is in the other direction ... eventually driving my GPU debt because I am accumulating GPU debt against all 30 other projects ...

My Q9300 sees less of this because the quad core is likely fairly balanced against the GTX280 card so the debt driver is acting more slowly because the GPU is fast enough that the debts stay sort of in balance (best guess), or to put it another way, the 30 projects are building up GPU debt at about the same rate that GPU Grid is running up CPU debt in the other direction ... sooner or later though I do hit walls there are have had to hit debt reset to get back on balance.

This may ALSO partly explain Richard's observation on nil calls to projects (which I also see) where the system is trying manfully to get work from a project that cannot supply it. In my case it is often a call to GPU Grid to get CPU work. Not going to happen.

Not sure how to cure this in that for one thing I think there is at LEAST two problems buried in there if not three.

In effect, we really, really, need to track which projects supply CPU work and which GPU and which both ... and by that I mean the ones that the participant has allowed. So, the debt for me for GPUs should only reflect activities on GPU Grid my sole attached project with GPU work and GPU Grid should never be accumulating CPU debt.

Sleepy_63
Send message
Joined: 27 Mar 09
Posts: 1
Credit: 5,292,694
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8841 - Posted: 24 Apr 2009 | 14:50:33 UTC
Last modified: 24 Apr 2009 | 15:01:08 UTC

I've been getting the 'No cuda work requested' messages too. Since it has been days since I got a GPUGrid WU, but SETI-cuda is running fine, I knew my hardware and drivers were okay.

I reset the GPUGrid project and immediately got workunits.

FWIW.

Edit: but all is not well, with a quad-core cpu and dual-core GPU (Nvidia 9800GX2), I should be running 6 tasks: 4 cpu and 2 cuda. It just paused a seti-cuda task to run the GPUgrid, leaving only 5 tasks active... <sigh>

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 8847 - Posted: 24 Apr 2009 | 16:07:01 UTC
Last modified: 24 Apr 2009 | 16:07:58 UTC

This 6.6.20 problem may mostly affect people running multiple GPU setups

Sadly no on my single gpu system i have the same although my system just ends the unit sends it and then receives a new one or sometimes i receive 4 new which probably get cancelled sooner or later by the server ;)
So the issue is more wide spread and seem to affect more projects, but these projects send much more units and/or have longer time to be used or run much longer.
That makes them have less problems then the gpugrid project which is time critical.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8862 - Posted: 24 Apr 2009 | 19:35:54 UTC - in response to Message 8841.

I've been getting the 'No cuda work requested' messages too. Since it has been days since I got a GPUGrid WU, but SETI-cuda is running fine, I knew my hardware and drivers were okay.

I reset the GPUGrid project and immediately got workunits.

FWIW.

Edit: but all is not well, with a quad-core cpu and dual-core GPU (Nvidia 9800GX2), I should be running 6 tasks: 4 cpu and 2 cuda. It just paused a seti-cuda task to run the GPUgrid, leaving only 5 tasks active... <sigh>

6.6.20 and above are still works in progress. I did not, and do not think 6.6.20 was ready for prime time. It works, mostly, but it actually does not work as well as 6.5.0 IMO ... especially when you have more than one GPU in the system.

When you did a project reset you reset the debt on the one project. The problem is that you did not reset the debts on the other projects. To clear up most of the scheduling problems when you have anomalies like this you need to use the debt reset flag in the cc_config file, stop and restart the client (reading config will not reset debts). Be sure to change the flag back to 0 after you stop and restart.

6.6.23 actually seems to be worse on the debt management. 6.6.24 seems to insist that the number 2 GPU is not like all the others and refuses to use it regardless of how identical it is ...

The fix in 6.6.24 to address excessive task switching also did not clear up the problem though it may have addressed a bug that exaggerated the problem (or may have been inconsequential).

Waiting on 6.6.25 ...

Seriously, if you are having problems with work fetch drop back to 6.5.0 ... the only thing you lose is some debug message improvements and the change to time tracking (you can't see how long the task has to run correctly).

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8894 - Posted: 25 Apr 2009 | 12:33:40 UTC - in response to Message 8862.

Thanks for your effort to put this bug report together. If the developers are not totally blind or have put you onto their ignore lists they should be able to see that this is not just chatting, it's a real problem.

And I guess sometimes you hate to be right.. you said many times 6.6.20 was not ready (and from the various reports it clearly wasn't) and has serious debt issues. Well, that's what we're seeing now.

/me is sticking to 6.5.0 a while longer.

MrS
____________
Scanning for our furry friends since Jan 2002

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8902 - Posted: 25 Apr 2009 | 13:10:34 UTC - in response to Message 8797.
Last modified: 25 Apr 2009 | 13:11:37 UTC

No work sent
Full-atom molecular dynamics on Cell processor is not available for your type of computer.
cuda app exists for Full-atom molecular dynamics but no cuda work requested.


I understand this message in the way that BOINC does request work from GPU-Grid, but it does not request CUDA work (which wo9uld be extremely strange / stupid) and hence the server is not sending CUDA work.
Am I totally wrong here?

MrS


If you turn on the cc_config flag <sched_op_debug> you will see what its requesting. It is not a bug.

BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide.

If you have recently upgraded to the 6.6 client I would suggest you reset the debts. This can be done using the cc_config flag <zero_debts>
____________
BOINC blog

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 60,073,744
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 8905 - Posted: 25 Apr 2009 | 13:20:06 UTC - in response to Message 8902.

If you have recently upgraded to the 6.6 client I would suggest you reset the debts. This can be done using the cc_config flag <zero_debts>


And if you do this, don't forget to remove it after the you restart BOINC. If left in there, the debts would be reset everytime BOINC starts.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8917 - Posted: 25 Apr 2009 | 15:00:16 UTC - in response to Message 8905.
Last modified: 25 Apr 2009 | 15:00:45 UTC

If left in there, the debts would be reset everytime BOINC starts.


Which actually might be a good idea. Ignore the debts and treat the resource share as an approximation.. don't stick to the code, they're more like guidelines anyway ;)

BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide.


Wouldn't that completely screw the scheduling? BOINC would quickly assign a massive debt to CPU-GPU-Grid, which can never be reduced as there is no cou client? Which would in turn screw the scheduling of all other cpu projects?

This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work.

(.. please don't take this a personal offense, I'm just thinking a little further ahead ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8934 - Posted: 25 Apr 2009 | 18:12:25 UTC - in response to Message 8917.

This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work.

The new debt system is supposed to track two debt levels for each project. The problem is that if you have only one project of one resource class you can and will get unconstrained growth of the debt for that resource.

I get if for GPU Grid on my i7 (running 6.6.23 *NOT RECOMMENDED* I do not recommend running 6.6.23 or 6.6.24, NOTE I AM TESTING... and 6.6.23 is PAINFUL for GPU Grid ... YMMV) ...

Another participant running GPU Grid and Rosetta@Home, only the two projects gets if for Rosetta ...

See the fuller discussion BOINC v6.6.20 scheduler issues (most specifically Message 60808 or the BOINC Alpha and Dev mailing lists for an even fuller discussion.

The net effect is that you stop getting a full queue of tasks for the one resource.

Sadly, even in the face of providing them lots of logs and other data I am not sure they have even started looking at this problem.

The good news is that they are finally starting to take seriously a problem I pointed out when I first saw it about 2005 when I bought my first dual Xeons with HT (the first quad CPU systems) and is now a killer on 8 CPU systems ... especially if you also add in multiple GPUs ...

The system I am considering building this summer will be at least an i7 and I hope to put at least 3 GTX295 cards into it ... making it have 8 CPUs and 6 GPUs ... 14 processors ... an alternative is a dual Xeon again ... that would be 16 CPUs and 6 GPUs (or 8 with 4 PCI-e slots) ... that will make the problem I noted a real killer ...

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8940 - Posted: 26 Apr 2009 | 1:16:44 UTC - in response to Message 8917.

If left in there, the debts would be reset everytime BOINC starts.


Which actually might be a good idea. Ignore the debts and treat the resource share as an approximation.. don't stick to the code, they're more like guidelines anyway ;)

BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide.


Wouldn't that completely screw the scheduling? BOINC would quickly assign a massive debt to CPU-GPU-Grid, which can never be reduced as there is no cou client? Which would in turn screw the scheduling of all other cpu projects?

This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work.

(.. please don't take this a personal offense, I'm just thinking a little further ahead ;)

MrS


Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work.

Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable.

Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway.
____________
BOINC blog

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8942 - Posted: 26 Apr 2009 | 1:28:20 UTC - in response to Message 8940.

Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable.

Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway.

Um, did not think I was being pessimistic ... I thought rational was more like it ... but Ok ... :)

If 6.6.23 or .24 works for you ... cool ... .23 *IS* better than .20 in my opinion, though if you run single project as I do it seems to have the debt problem. If you don't mind resetting debts on occasion then go for it.

The main improvements in .23 had to do with initialization crashes and CUDA task switches which were not handled properly. What I saw on .20 was at times the tasks took twice as long to run. Have not seen that at all on .23 ... and I have been running the heck out of .23 on the i7 ... but, 24-48 hours later, I can't get 4 queued tasks from GPU Grid ... reset debts and I am good to go ...


In .24 there is a huge mistake of some kind and my second of 4 GPUs is suddenly not the same as the others ... in that it always is the second of teh GPUs, sounds like a bug to me ... not sure where ... I suggested a change to print out the exact error, lets see if they pick that up ... and or find the real problem (I looked and saw nothing that leaped out at me ... but I am not a C programmer).

For me to notice someone trying to offend me you have to be at Dick Cheney level of effort to get me to even notice you are trying ... so, I don't do offended... :)

And so, one of the reasons I don't understand why others do ... thankfully you don't ... :)

Now if others would be so reasonable ...

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8943 - Posted: 26 Apr 2009 | 2:26:58 UTC - in response to Message 8942.

Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable.

Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway.

Um, did not think I was being pessimistic ... I thought rational was more like it ... but Ok ... :)

If 6.6.23 or .24 works for you ... cool ... .23 *IS* better than .20 in my opinion, though if you run single project as I do it seems to have the debt problem. If you don't mind resetting debts on occasion then go for it.

The main improvements in .23 had to do with initialization crashes and CUDA task switches which were not handled properly. What I saw on .20 was at times the tasks took twice as long to run. Have not seen that at all on .23 ... and I have been running the heck out of .23 on the i7 ... but, 24-48 hours later, I can't get 4 queued tasks from GPU Grid ... reset debts and I am good to go ...


In .24 there is a huge mistake of some kind and my second of 4 GPUs is suddenly not the same as the others ... in that it always is the second of teh GPUs, sounds like a bug to me ... not sure where ... I suggested a change to print out the exact error, lets see if they pick that up ... and or find the real problem (I looked and saw nothing that leaped out at me ... but I am not a C programmer).

For me to notice someone trying to offend me you have to be at Dick Cheney level of effort to get me to even notice you are trying ... so, I don't do offended... :)

And so, one of the reasons I don't understand why others do ... thankfully you don't ... :)

Now if others would be so reasonable ...


I haven't had to reset debts on any of my machines, but I don't run a single project. I usually have 3 (or when Einstein went off last week 4) running.

.23 seemed to have fixed the never-ending GPUgrid wu bug.

Apart from the debugging messages .24 doesn't seem to correct anything. But then i've only got it installed on a single gpu machine because of the "can't find 2nd gpu" bug.
____________
BOINC blog

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8963 - Posted: 26 Apr 2009 | 21:22:36 UTC - in response to Message 8943.

I haven't had to reset debts on any of my machines, but I don't run a single project. I usually have 3 (or when Einstein went off last week 4) running.

.23 seemed to have fixed the never-ending GPUgrid wu bug.

Apart from the debugging messages .24 doesn't seem to correct anything. But then i've only got it installed on a single gpu machine because of the "can't find 2nd gpu" bug.

It is not a problem running a single project, it is running a single project of a particular resource class. And I also think that the speed of the system plays a part in how fast the debts get out of whack.

I run 6.6.20 on the Q9300 and it has a single GPU and does not seem to get into trouble that fast. The i7 on the other hand only lasts a day or so before the GPU Grid debt is so out of whack that I have to reset it so that I can keep 4 tasks in the queue. If I don't reset the debts, well, pretty soon all I have is the tasks running on the 4 GPUs. It is possible if I had only one or two GPUs in the system that it would not get out of whack so fast ... but ...

The change in .24 was in response to some discussions on the lists about the asymetry of GPUs ... I think the decision was wrong and hope we can get some reasonableness going ... but so far there has been no acknowledgement that this is a bad choice ... hopefully Dr. Korpela at SaH will speak up and the PM types here too ... if they don't the chances of getting the change backed out are lower (Note they can also do silent e-mails directly to Dr. A) ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9020 - Posted: 27 Apr 2009 | 21:21:35 UTC - in response to Message 8940.

Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work.


Thanks for explaining. Still looks stupid: if someone has a CUDA device and is attached to 50 cpu projects, then 6.6.2x will continue to request GPU work from all of them? I really hope the new versions of the server software feature some flag to tell the clients which work they can expect from them..

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9034 - Posted: 27 Apr 2009 | 21:51:25 UTC - in response to Message 9020.

Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work.


Thanks for explaining. Still looks stupid: if someone has a CUDA device and is attached to 50 cpu projects, then 6.6.2x will continue to request GPU work from all of them? I really hope the new versions of the server software feature some flag to tell the clients which work they can expect from them..

And that is exactly what happens.

In the last 26 hours I have hit the 50 some projects I am attached to with 800 some requests, most of them are probably asking for CUDA work because my GPU debt is high and climbing in that I am only attached to GPU Grid for GPU work.

What they are relying on is the "back-off" mechanism with the assumption that the number of requests is nominal. The problem is that DoS is also a very small request, just made lots of times. Multiply my 800 requests times 250,000 participants and pretty soon you are talking some real numbers.

I have TWO threads going on the alpha list right now about this type of lunacy where, surprisingly, John McLeod VII is arguing for policies that are a waste of time and lead to system instability because the cost of doing the policy is "low". The trouble is that reality is that the numbers are not as low as he insists that they are ...

Worse, the repetitive nature of this obsessive checking (as often as once every 10 seconds or faster) is that the logs get so full of stuff that you cannot find meaningful instances of problems that you are trying to cure.

Just because you can do something does not mean that you should. A lesson the BOINC developers have chosen not to learn yet. I would point out that the latest spate of server outages took place shortly after 6.6.20 was made the standard... coincidence? Maybe, maybe not ...

But why they are so blase about adding to the loads of the scheduler Is beyond me ...

My latest post on DoS and 6.6.20+:
Ok,

Related to the debt issue of these later clients with GPU work
shortfall there is a side issue.

I turned on sched_op_debug and have been watching my i7 mount a DOS
attack on most projects trying to get CUDA work from projects that
don't have any and are not likely to have any anytime soon.

So, my GPU debt is climbing for AI ... but the AI project does not
have a CUDA application, so, I ping the server, it backs me off, I
ping again 7 seconds later and slowly back off ... the problem is that
with sufficient 6.6.x clients all doing this ... well ... DOS attack ...

This is another case of too much of a good thing being bad for the
system as a whole.

The assumption that the rates are low, the cost is low, ignores the
fact that it is not necessary ... and things that are not necessary
should not be done regardless of how low we think the cost might be ...

I suggested earlier this week, well last calendar week, that we add a
flag from project preferences that would block the request of CUDA
work unless it was explicitly set by the project.

For the moment that keeps the number of projects that have to make a
server side change low (SaH, SaH Beta, GPU Grid and The Lattice
Project). These would add a project "preference" that would indicate
that GPU work is allowed, much like the preference setting is on
SaH... GPU Grid would not SHOW the setting as it is meaningless to do
so ... but, the client would NOT issue GPU work requests to project
without this flag set.

This will stop the mounting DoS attacks on the servers, and lower the
frequency of mindless CPU scheduling events ...



And my prior:

Perhaps we should make the flags explicit in the system side revision where:

<cpu>1
<gpu>1

Have to be set specifically. (assume CPU=1)

Then these flags could be used to control debt allocation. GPU Grid would be of course:

<cpu>0
<gpu>1

Prime Grid (at the moment)

<cpu>1
<gpu>0

and so on ...

If not explicitly set by the project the assumption would be:

<cpu>1
<gpu>0


Of course the most depressing thing is that as John explicitly said, if I keep saying things that "they" don't want to hear, "they" are going to keep ignoring me ... my reply was, of course, just because I am saying things that he, and others, might not want to hear does not make me wrong ... nor will ignoring problems make them go away ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9082 - Posted: 28 Apr 2009 | 20:50:05 UTC - in response to Message 9034.
Last modified: 28 Apr 2009 | 20:53:23 UTC

Wow, now that makes me want to scream.

Many of our far too many software issues happen because people don't plan properly. They didn't plan to include features which later on become necessary and everything becomes a mess when these features are "hacked in". In our case the BOINC devs actually have the benefit of knowing what they will need: the ability to handle a heterogeneous landscape with different coprocessors.

Do any of their current changes factor ATIs and Larrabees in, even remotely? Or the different CUDA hardware capabilities? It doesn't look like it, judging by the way the term "the GPU" is used, as if there was only one kind.

Just imagine what 10 possible coprocessors will do to DOS attacks: each host issuing 10 request every ~10s to each project it's attached to? How can one even remotely like this idea? Sure, currently the request can be handled, but why take the risk of letting this grow out of hands and invest time struggling to fix the side effects this has on the local scheduler?!

I'm not a real software developer, but I know that if you do things the quick & dirty way, many of them [i]are8/i] going to bite you into the a**...

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9092 - Posted: 28 Apr 2009 | 23:35:32 UTC - in response to Message 9082.

Wow, now that makes me want to scream.

It does not help me much either. Being suicidally depressed as a normal state with medication not being effective, I really don't need the aggravation.

When they were first asking about the 6.6.x to 6.8.x versions we (Richard Hasslegrove, Nicholas Alveres (sp?), and a few others (sorry guys forgot the list) made a lot of suggestions ... as I said before, none of them were considered.

Now we see issues with work fetch and resource scheduling. To the point where my system is bordering on chaos ... I cannot imagine what a 16 CPU system with 6 GPU cores will look like. Though there is some glimmer that there is an issue, it is the same, lets tinker with the rules and not make any big changes

Sadly, I know, based on experience that this will not work. Yes they may be able to fake it for some more time. BUt it would be better an cleaner to start anew.

Theory says that they left room for future GPU and other co-processor types in the mix. Nick does not think that they virtualized enough and though I cannot read the code well enough (I don't do C well, C++ as hacked less well) to know for sure, but, it sure does not look like he is wrong.

The issue is that none of them are systems engineers (I was) and don't really consider, or know, issues and charge on with the courage of their convictions that because they can hack together code they know what they are doing. The courage and skill of amateurs.

At one point I specialized in database design and most people don't know that there are three types of DBAs or database specialists ... the logical designer that is the one interested in the data life-cycle and data models (thats what I did) and is generally not interested or concerned about speed or efficiency (what I mean is that this is not a primary concern, though you do know what will make the system fast or slow).

Completed database models are implemented and tuned by a Systems DBA (a class of DBA most people have never met. There just are not that many of them around.) This guy tunes the hardware and system software (may even select and buy it specifially for the data model to be implemented) and creates things like table spaces and lays the data out on the physical media. Backups and all the system stuff is designed by this guy.

The third guy is the type of DBA most people know about. He knows a lot of stuff but is mostly concerned about the day to day operation of the database. Though he may know about making tables and putting them on disks ... well ... it is a art and few do it well ...

What is the point of all this? BOINC's database was put together by the third kinda DBA and amateurs ... it is one of the reasons that the databases are so fragile ... and crash so often ... I was doing BOINC while I was still working and showed the data model to a systems DBA I knew and he thought it was as poor of a design as I did ...

Anyway, the study of logical database design for relational databases has a point to it ... ignore the "rules" at your peril ... and we can see the result of the choices made ...

Anyway, I sent in a pseudo code outline of what I think should be done for Resource Scheduling so we can solve that problem that is coming on 5 years old now ... I will tackle the work fetch and DoS issues 5 years from now when they agree (finally) that it is an issue ... if history is a guide ... RH and I though are trying to bring it up along with other work fetch issues in 6.6.23. 24. and now 25 ...

Post to thread

Message boards : Graphics cards (GPUs) : no cuda work requested

//