Advanced search

Message boards : Graphics cards (GPUs) : Client detached by itself?

Author Message
Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15837 - Posted: 19 Mar 2010 | 19:49:58 UTC

2 WU:
2017815
2016525

but i got a record from boinctasks that it was completed successfully. whats going on?


____________

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16133 - Posted: 2 Apr 2010 | 22:27:24 UTC
Last modified: 2 Apr 2010 | 22:41:38 UTC

Its happaning again!



There are no error messages in the logs.

edit:
Something is wrong with the server.

Here's what happen on my side after successfully finishing a WU and uploaded fine:
04/02/2010 6:33:05 PM GPUGRID Computation for task p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1 finished
04/02/2010 6:33:06 PM GPUGRID Starting a158-TONI_HERG77a-17-100-RND0346_3
04/02/2010 6:33:06 PM GPUGRID Starting task a158-TONI_HERG77a-17-100-RND0346_3 using acemd2 version 603
04/02/2010 6:33:06 PM GPUGRID Sending scheduler request: To fetch work.
04/02/2010 6:33:06 PM GPUGRID Requesting new tasks for GPU
04/02/2010 6:33:07 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_0
04/02/2010 6:33:07 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_1
04/02/2010 6:33:11 PM GPUGRID Scheduler request completed: got 0 new tasks
04/02/2010 6:33:11 PM GPUGRID Message from server: No work sent
04/02/2010 6:33:11 PM GPUGRID Message from server: No work is available for ACEMD - GPU molecular dynamics
04/02/2010 6:33:11 PM GPUGRID Message from server: No work is available for ACEMD beta version
04/02/2010 6:33:11 PM GPUGRID Message from server: (reached limit of 4 GPU tasks in progress)
04/02/2010 6:33:11 PM GPUGRID Message from server: Project has no jobs available
04/02/2010 6:33:26 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_0
04/02/2010 6:33:26 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_2
04/02/2010 6:33:29 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_1
04/02/2010 6:33:29 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_3
04/02/2010 6:33:43 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_2
04/02/2010 6:33:43 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_7
04/02/2010 6:33:44 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_7
04/02/2010 6:33:45 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_3
04/02/2010 6:33:46 PM GPUGRID Sending scheduler request: To report completed tasks.
04/02/2010 6:33:46 PM GPUGRID Reporting 1 completed tasks, requesting new tasks for CPU and GPU
04/02/2010 6:33:51 PM GPUGRID Scheduler request completed: got 1 new tasks
04/02/2010 6:33:53 PM GPUGRID Started download of h200-TONI_CAPBIND1-13-LICENSE

Can some take a look on whats happening?

I also have this issue in the other machine yesterday.
WUs:
2079037
2078443
2077898
2076113
____________

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 186
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16135 - Posted: 3 Apr 2010 | 0:47:40 UTC - in response to Message 16133.

Just had to do a manual update to get more tasks. Something is not quite right!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16136 - Posted: 3 Apr 2010 | 1:10:27 UTC - in response to Message 16135.

Boinc Advanced View,
Activity,
Activity is always available???

I think this changed for me by itself on one system a couple of days ago, dont know why. Other systems were OK - network issues, or a freak event?
You may want to check All your network settings, if you have not already done so.

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16137 - Posted: 3 Apr 2010 | 1:30:48 UTC
Last modified: 3 Apr 2010 | 1:56:59 UTC

All my systems are fine. Like I said, something is happening on the server.

To make it clear:
Server (as noted in the website): says client detached
Client (using boinc manager or boinc tasks): says its running and ready to run

-Do a manual update to sync with server, nothing happens
-Client upload a completed WU, server accepts but website still says client detached

Now, where does my completed WU goes? Oblivion?

I think its similar to "ghost wu"(listed on website but not on client) but the opposite (listed on client but not on website).

Edit:
this is the only error i've found during the time window (-4 UTC)

02-Apr-2010 15:51:33 [GPUGRID] Sending scheduler request: To fetch work.
02-Apr-2010 15:51:33 [GPUGRID] Requesting new tasks for GPU
02-Apr-2010 15:56:49 [---] Project communication failed: attempting access to reference site
02-Apr-2010 15:56:50 [---] Internet access OK - project servers may be temporarily down.
02-Apr-2010 15:56:52 [GPUGRID] Scheduler request failed: Timeout was reached

____________

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16138 - Posted: 3 Apr 2010 | 8:21:35 UTC

Do you use an account manager?

There was a similar complaint from someone recently but it turned out to their account manager settings.
____________
BOINC blog

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16140 - Posted: 3 Apr 2010 | 8:46:30 UTC - in response to Message 16138.

Do you use an account manager?

There was a similar complaint from someone recently but it turned out to their account manager settings.

Nope, not using.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16149 - Posted: 3 Apr 2010 | 17:32:48 UTC - in response to Message 16140.
Last modified: 3 Apr 2010 | 17:46:53 UTC

I'm trying to figure out what's wrong.

May have to do with http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?f=3&t=853&start=15

The server logs say that your CPID changed at some point. Are your running multiple projects and, if so, do the e-mail addresses match?

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16150 - Posted: 3 Apr 2010 | 18:17:50 UTC - in response to Message 16149.

The server logs say that your CPID changed at some point. Are your running multiple projects and, if so, do the e-mail addresses match?

I'm running WCG for cpu and SETI as backup for gpu. The same email address used for the 3 projects. And I think CPID only change if I change my email which I didn't. Its mind bugling! Now I have to read the link you've posted to gain some sights.
____________

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16151 - Posted: 3 Apr 2010 | 19:00:54 UTC

Can you confirm the time I got from "Scheduler request failed: Timeout was reached" and the server logs that says my CPID has changed? It could happen that the server increaments my rpcno while the client has not.

Based on the 2 post I've read on the link you've provided:

p.s.: Scheduler request failed: Timeout was reached might cause it (even though I do not think so), the server increments the RPC seqno, the client does not because it assumes that the server didn't receive the request. On a fetch request this causes "ghost WUs", WUs that you have in your list in your server side profile but never had them on your BOINC host.

We see a 'hanging' TCP connection, that somehow stays open, or is reopened by chance, that will the sport a different 'server contacts #' flag, which is why the server discards the host.
So, if you are a candidate for a hanging connection, you will need to contact the server in the meantime to 'qualify' for a detach;
The detach itself will take place on an update to the server, ... ok... now it fits...
but only in a very rare among the rare conditions the detach will be executed on the result-report-update send to the server.
Meaning, the pick-up of the hanging connection actually takes place in the connection designated to tell the server about the finished results. Very, very, very rare.

____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16152 - Posted: 3 Apr 2010 | 20:01:12 UTC - in response to Message 16151.

So it was an odd networking problem in which a network service may have failed (probably on the server, but perhaps on a router). If it was client side a simple restart would have resolved it, for subsequent tasks. Perhaps this resulted in a partial database entry or a corruption.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16160 - Posted: 4 Apr 2010 | 18:52:56 UTC - in response to Message 16152.

Did you try to stop and restart the client?

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16161 - Posted: 4 Apr 2010 | 20:27:45 UTC - in response to Message 16160.

yes I did.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16167 - Posted: 5 Apr 2010 | 10:11:56 UTC - in response to Message 16161.

> Can you confirm the time I got from "Scheduler request failed: Timeout was
> reached" and the server logs that says my CPID has changed?

Yes.

Is the problem solved now?

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16169 - Posted: 5 Apr 2010 | 10:45:55 UTC - in response to Message 16167.

> Can you confirm the time I got from "Scheduler request failed: Timeout was
> reached" and the server logs that says my CPID has changed?

Yes.

Is the problem solved now?

Actually, it resolved by itself. Crunching the 4 WUs thats been detached then got a new one (like normal crunching). The only thing that's lost is the crunched time. Luckily on the 2nd rig, I caught it so I aborted the ghost wu and started anew.

So the caused was rpcno not in sync? Maybe its a good idea to enable re-sends?
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16175 - Posted: 5 Apr 2010 | 17:17:39 UTC - in response to Message 16169.

Would reset project not do that?

Post to thread

Message boards : Graphics cards (GPUs) : Client detached by itself?

//