Author |
Message |
|
2 WU:
2017815
2016525
but i got a record from boinctasks that it was completed successfully. whats going on?
____________
|
|
|
|
Its happaning again!
There are no error messages in the logs.
edit:
Something is wrong with the server.
Here's what happen on my side after successfully finishing a WU and uploaded fine:
04/02/2010 6:33:05 PM GPUGRID Computation for task p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1 finished
04/02/2010 6:33:06 PM GPUGRID Starting a158-TONI_HERG77a-17-100-RND0346_3
04/02/2010 6:33:06 PM GPUGRID Starting task a158-TONI_HERG77a-17-100-RND0346_3 using acemd2 version 603
04/02/2010 6:33:06 PM GPUGRID Sending scheduler request: To fetch work.
04/02/2010 6:33:06 PM GPUGRID Requesting new tasks for GPU
04/02/2010 6:33:07 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_0
04/02/2010 6:33:07 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_1
04/02/2010 6:33:11 PM GPUGRID Scheduler request completed: got 0 new tasks
04/02/2010 6:33:11 PM GPUGRID Message from server: No work sent
04/02/2010 6:33:11 PM GPUGRID Message from server: No work is available for ACEMD - GPU molecular dynamics
04/02/2010 6:33:11 PM GPUGRID Message from server: No work is available for ACEMD beta version
04/02/2010 6:33:11 PM GPUGRID Message from server: (reached limit of 4 GPU tasks in progress)
04/02/2010 6:33:11 PM GPUGRID Message from server: Project has no jobs available
04/02/2010 6:33:26 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_0
04/02/2010 6:33:26 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_2
04/02/2010 6:33:29 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_1
04/02/2010 6:33:29 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_3
04/02/2010 6:33:43 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_2
04/02/2010 6:33:43 PM GPUGRID Started upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_7
04/02/2010 6:33:44 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_7
04/02/2010 6:33:45 PM GPUGRID Finished upload of p12-IBUCH_05012_pYEEI_100319-3-80-RND5598_1_3
04/02/2010 6:33:46 PM GPUGRID Sending scheduler request: To report completed tasks.
04/02/2010 6:33:46 PM GPUGRID Reporting 1 completed tasks, requesting new tasks for CPU and GPU
04/02/2010 6:33:51 PM GPUGRID Scheduler request completed: got 1 new tasks
04/02/2010 6:33:53 PM GPUGRID Started download of h200-TONI_CAPBIND1-13-LICENSE
Can some take a look on whats happening?
I also have this issue in the other machine yesterday.
WUs:
2079037
2078443
2077898
2076113
____________
|
|
|
|
Just had to do a manual update to get more tasks. Something is not quite right! |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Boinc Advanced View,
Activity,
Activity is always available???
I think this changed for me by itself on one system a couple of days ago, dont know why. Other systems were OK - network issues, or a freak event?
You may want to check All your network settings, if you have not already done so. |
|
|
|
All my systems are fine. Like I said, something is happening on the server.
To make it clear:
Server (as noted in the website): says client detached
Client (using boinc manager or boinc tasks): says its running and ready to run
-Do a manual update to sync with server, nothing happens
-Client upload a completed WU, server accepts but website still says client detached
Now, where does my completed WU goes? Oblivion?
I think its similar to "ghost wu"(listed on website but not on client) but the opposite (listed on client but not on website).
Edit:
this is the only error i've found during the time window (-4 UTC)
02-Apr-2010 15:51:33 [GPUGRID] Sending scheduler request: To fetch work.
02-Apr-2010 15:51:33 [GPUGRID] Requesting new tasks for GPU
02-Apr-2010 15:56:49 [---] Project communication failed: attempting access to reference site
02-Apr-2010 15:56:50 [---] Internet access OK - project servers may be temporarily down.
02-Apr-2010 15:56:52 [GPUGRID] Scheduler request failed: Timeout was reached
____________
|
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
Do you use an account manager?
There was a similar complaint from someone recently but it turned out to their account manager settings.
____________
BOINC blog |
|
|
|
Do you use an account manager?
There was a similar complaint from someone recently but it turned out to their account manager settings.
Nope, not using.
____________
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
I'm trying to figure out what's wrong.
May have to do with http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?f=3&t=853&start=15
The server logs say that your CPID changed at some point. Are your running multiple projects and, if so, do the e-mail addresses match? |
|
|
|
The server logs say that your CPID changed at some point. Are your running multiple projects and, if so, do the e-mail addresses match?
I'm running WCG for cpu and SETI as backup for gpu. The same email address used for the 3 projects. And I think CPID only change if I change my email which I didn't. Its mind bugling! Now I have to read the link you've posted to gain some sights.
____________
|
|
|
|
Can you confirm the time I got from "Scheduler request failed: Timeout was reached" and the server logs that says my CPID has changed? It could happen that the server increaments my rpcno while the client has not.
Based on the 2 post I've read on the link you've provided:
p.s.: Scheduler request failed: Timeout was reached might cause it (even though I do not think so), the server increments the RPC seqno, the client does not because it assumes that the server didn't receive the request. On a fetch request this causes "ghost WUs", WUs that you have in your list in your server side profile but never had them on your BOINC host.
We see a 'hanging' TCP connection, that somehow stays open, or is reopened by chance, that will the sport a different 'server contacts #' flag, which is why the server discards the host.
So, if you are a candidate for a hanging connection, you will need to contact the server in the meantime to 'qualify' for a detach;
The detach itself will take place on an update to the server, ... ok... now it fits...
but only in a very rare among the rare conditions the detach will be executed on the result-report-update send to the server.
Meaning, the pick-up of the hanging connection actually takes place in the connection designated to tell the server about the finished results. Very, very, very rare.
____________
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
So it was an odd networking problem in which a network service may have failed (probably on the server, but perhaps on a router). If it was client side a simple restart would have resolved it, for subsequent tasks. Perhaps this resulted in a partial database entry or a corruption. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Did you try to stop and restart the client? |
|
|
|
yes I did.
____________
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
> Can you confirm the time I got from "Scheduler request failed: Timeout was
> reached" and the server logs that says my CPID has changed?
Yes.
Is the problem solved now? |
|
|
|
> Can you confirm the time I got from "Scheduler request failed: Timeout was
> reached" and the server logs that says my CPID has changed?
Yes.
Is the problem solved now?
Actually, it resolved by itself. Crunching the 4 WUs thats been detached then got a new one (like normal crunching). The only thing that's lost is the crunched time. Luckily on the 2nd rig, I caught it so I aborted the ghost wu and started anew.
So the caused was rpcno not in sync? Maybe its a good idea to enable re-sends?
____________
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Would reset project not do that? |
|
|