Advanced search

Message boards : News : Experimental Python tasks (beta) - task description

Author Message
abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 56977 - Posted: 17 Jun 2021 | 10:40:32 UTC

Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread.

What are we trying to accomplish?
We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings.

Why most jobs were failing a few weeks ago?
It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully.

Why are GPUs being underutilized? and why are CPU used for?
In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid.

More information:
We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future.
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 114,775,136
RAC: 15,420
Level
Cys
Scientific publications
wat
Message 56978 - Posted: 17 Jun 2021 | 11:46:18 UTC
Last modified: 17 Jun 2021 | 12:08:24 UTC

Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now.

A couple of questions though:

1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used?
2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches?
3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks?
4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model?
What do you envision to be

"problems [so far] unattainable in smaller scale settings"
?
5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other?

Thx! Keep up the great work!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 56979 - Posted: 17 Jun 2021 | 13:26:58 UTC - in response to Message 56977.

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
____________

Profile phi1258
Send message
Joined: 30 Jul 16
Posts: 4
Credit: 1,555,158,536
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwat
Message 56989 - Posted: 18 Jun 2021 | 11:21:31 UTC - in response to Message 56977.

This is a welcome advance. Looking forward to contributing.



Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56990 - Posted: 18 Jun 2021 | 12:04:08 UTC - in response to Message 56977.

Thank you very much for this advance.
I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more...
Best wishes.

_heinz
Send message
Joined: 20 Sep 13
Posts: 16
Credit: 3,433,447
RAC: 0
Level
Ala
Scientific publications
wat
Message 56994 - Posted: 20 Jun 2021 | 5:39:42 UTC
Last modified: 20 Jun 2021 | 5:43:47 UTC

Wish you sucess.
regards _heinz
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 56996 - Posted: 21 Jun 2021 | 11:28:16 UTC - in response to Message 56979.

Ian&Steve C. wrote on June 17th:

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.

I am courious what the answer will be

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 57000 - Posted: 22 Jun 2021 | 12:17:47 UTC

also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization.

when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57009 - Posted: 23 Jun 2021 | 20:09:29 UTC

I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project.

MrS
____________
Scanning for our furry friends since Jan 2002

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 114,775,136
RAC: 15,420
Level
Cys
Scientific publications
wat
Message 57014 - Posted: 24 Jun 2021 | 10:32:37 UTC

This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57020 - Posted: 26 Jun 2021 | 7:53:10 UTC

https://www.youtube.com/watch?v=yhJWAdZl-Ck

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58044 - Posted: 10 Dec 2021 | 11:32:51 UTC

I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any?

Examnple:
https://www.gpugrid.net/workunit.php?wuid=27100605

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58045 - Posted: 10 Dec 2021 | 11:56:26 UTC - in response to Message 58044.

Host 132158 is getting some. The first failed with:

File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run
sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd)))
NameError: name 'cmd' is not defined
----------------------------------------
ERROR: Failed building wheel for atari-py
ERROR: Command errored out with exit status 1:
command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py
cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/

Looks like a typo.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58058 - Posted: 11 Dec 2021 | 0:23:09 UTC

Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used.

We just got put back in good graces with a whitelist at Gridcoin too.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58061 - Posted: 11 Dec 2021 | 2:16:29 UTC

@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment.
It probably needs to be exported into the userland environment.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58104 - Posted: 14 Dec 2021 | 16:55:30 UTC - in response to Message 58045.

Hello everyone, sorry for the late reply.

we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.

The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.

Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.

I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58112 - Posted: 15 Dec 2021 | 15:31:49 UTC - in response to Message 58104.

Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58114 - Posted: 15 Dec 2021 | 16:12:09 UTC - in response to Message 58112.

The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts.

It is the one that talk about a dependency called "pinocchio" and detects conflicts with it.

Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App.



____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58115 - Posted: 15 Dec 2021 | 16:56:36 UTC - in response to Message 58114.

OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58116 - Posted: 15 Dec 2021 | 19:29:54 UTC
Last modified: 15 Dec 2021 | 19:48:28 UTC

Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too.

Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine.

This host also has 16 GB system RAM: GPU is GTX 1660 Ti.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58117 - Posted: 15 Dec 2021 | 19:40:45 UTC - in response to Message 58114.
Last modified: 15 Dec 2021 | 19:43:12 UTC

I reset the project on my host. still failed.

WU: http://gpugrid.net/workunit.php?wuid=27102456

I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host.

Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual.

However, I can elaborate further to confirm that it's real.

https://i.imgur.com/XwAj4s3.png

a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up?

I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system?

abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58118 - Posted: 15 Dec 2021 | 23:25:46 UTC - in response to Message 58117.

I couldn't get your imgur image to load, just a spinner.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58119 - Posted: 16 Dec 2021 | 0:13:31 UTC - in response to Message 58118.

Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58120 - Posted: 16 Dec 2021 | 0:26:37 UTC

I've had two tasks complete on a host that was previously erroring out:

https://www.gpugrid.net/workunit.php?wuid=27102460
https://www.gpugrid.net/workunit.php?wuid=27101116

Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me.

Wildly different runtime/credit ratios, I would expect something in between.

Run time Credit Credit/sec
3,389.26 264,786.85 78/s
49,311.35 34,722.22 0.70/s

CUDA
26,635.40 420,000.00 15.77/s

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58123 - Posted: 16 Dec 2021 | 9:44:51 UTC - in response to Message 58117.

Hello everyone,

The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs.

Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive)

I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58124 - Posted: 16 Dec 2021 | 10:15:12 UTC - in response to Message 58120.

Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice.
____________

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 16
Credit: 5,861,424,525
RAC: 14,168,255
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58125 - Posted: 16 Dec 2021 | 10:23:45 UTC - in response to Message 58123.

On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000.
That I can manage.

However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58127 - Posted: 16 Dec 2021 | 14:18:23 UTC - in response to Message 58123.

thanks for the clarification.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like

<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
</app>

you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58132 - Posted: 16 Dec 2021 | 16:56:20 UTC - in response to Message 58127.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work.

Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite.

MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58134 - Posted: 16 Dec 2021 | 18:20:00 UTC

given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB.

app_config.xml

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>5.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<max_concurrent>3</max_concurrent>
</app>
</app_config>


will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings.

abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58135 - Posted: 16 Dec 2021 | 18:22:58 UTC

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58136 - Posted: 16 Dec 2021 | 18:52:22 UTC - in response to Message 58135.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58137 - Posted: 16 Dec 2021 | 19:14:26 UTC - in response to Message 58136.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?

Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58138 - Posted: 16 Dec 2021 | 19:18:43 UTC - in response to Message 58137.

good to know. so what I experienced was pretty similar.

I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58140 - Posted: 16 Dec 2021 | 21:00:08 UTC - in response to Message 58138.

Yes primarily Universe and a few TN-Grid tasks were running also.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58141 - Posted: 17 Dec 2021 | 10:17:36 UTC - in response to Message 58134.

I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with.

For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%).

I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58143 - Posted: 17 Dec 2021 | 16:48:28 UTC - in response to Message 58141.

I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly:

https://gpugrid.net/workunit.php?wuid=27102526
https://gpugrid.net/workunit.php?wuid=27102527
https://gpugrid.net/workunit.php?wuid=27102525


GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time).
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58144 - Posted: 17 Dec 2021 | 16:54:43 UTC - in response to Message 58143.
Last modified: 17 Dec 2021 | 16:58:05 UTC

the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally.

this one succeeded on the same host as the above three.

https://gpugrid.net/workunit.php?wuid=27102535
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58145 - Posted: 17 Dec 2021 | 17:21:35 UTC - in response to Message 58144.
Last modified: 17 Dec 2021 | 17:26:54 UTC

I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58146 - Posted: 17 Dec 2021 | 18:11:26 UTC

I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely.

Using about 17GB of system memory and the gpu utilization spikes up to 97% every once in a while with periods mostly spent around 12-17% with some brief spikes around 42%.

I got one of the first batch on another host that failed fast with similar along with all the wingmen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58147 - Posted: 17 Dec 2021 | 19:29:02 UTC

these new ones must be pretty long.

been running almost 2 hours now. and a lot higher VRAM use. over 6GB per task used on the VRAM. GPUs with less than 6GB have issues?

but it also seems that some of the system memory used can be shared. running 1 task shows ~17GB system mem use, but running 5x tasks shows about 53GB system mem use. that's as far as I'll push it on my 64GB machines.
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,757,047,630
RAC: 1,071,191
Level
Phe
Scientific publications
wat
Message 58148 - Posted: 17 Dec 2021 | 21:08:46 UTC
Last modified: 17 Dec 2021 | 21:09:41 UTC

I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM).

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58149 - Posted: 17 Dec 2021 | 23:27:43 UTC - in response to Message 58148.

I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it.

I am over 6 1/2 hours now on this daily driver with 10% still showing. I bet they never show anything BUT 10% done until they finish.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58150 - Posted: 18 Dec 2021 | 0:09:18 UTC - in response to Message 58149.

I had the same feeling, Keith
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58151 - Posted: 18 Dec 2021 | 0:14:15 UTC
Last modified: 18 Dec 2021 | 0:15:02 UTC

also those of us running these, should probably prepare for VERY low credit reward.

This is something I have observed for a long time with beta tasks here. there seems to be some kind of anti-cheat mecahnism (or bug) built into BOINC when using the default credit reward scheme (based on flops), if the calculated credit reward is over some value, the credit reward gets defaulted to some very low value. since these are so long running, and beta, I fully expect to see this happen. I've reported about this behavior in the past.

would be a nice surprise if not, but I have a strong feeling it'll happen.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58152 - Posted: 18 Dec 2021 | 1:14:41 UTC - in response to Message 58151.

I got one task early on that rewarded more than reasonable credit.
But the last one was way low but I thought I read a post from @abouh that he had made a mistake in the credit award algorithm and had corrected for that.
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#58124

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58153 - Posted: 18 Dec 2021 | 2:36:47 UTC - in response to Message 58152.
Last modified: 18 Dec 2021 | 3:02:51 UTC

That task was short though. The threshold is around 2million credit reward if I remember.

I posted about it in the team forum almost exactly a year ago. Don’t want to post some details publicly because it could encourage cheating. But for a long time credit reward of the beta tasks has been inconsistent and not calculated fairly IMO. Because the credit reward was so high, I noticed a trend that when the credit reward was supposed to be high enough (extrapolating the runtime with expected reward) it triggered a very low value. This only happened on long running (and hence potential high reward) tasks. Since these tasks are so long, I just think there’s a possibility we’ll see that again.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58154 - Posted: 18 Dec 2021 | 4:53:29 UTC - in response to Message 58151.
Last modified: 18 Dec 2021 | 5:23:09 UTC

confirmed.

Keith you just reported this one.

http://www.gpugrid.net/result.php?resultid=32731284

that value of 34,722.22 is the exact same "penalty value" i noticed before a year ago. for 11hrs worth of work (clock time). and 28hrs of "cpu time". interesting that the multithreaded nature of these tasks inflates the run time so much.

extrapolating from your successful run that did not hit a penalty, I'd guess that any task longer than about 2.5hrs is gonna hit the penalty value for these tasks. they really should just use the same credit scheme as acemd3. or assign static credit scaled for expected runtime, as long as all of the tasks are about the same size.

BOINC documentation confirms my suspicions on what's happening.

https://boinc.berkeley.edu/trac/wiki/CreditNew

Peak FLOP Count

This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.

When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds.

When a client finishes a job and reports its elapsed time T, we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * peak_flops(J)

The credit for a job J is typically proportional to PFC(J), but is limited and normalized in various ways.

Notes:

PFC(J) is not reliable; cheaters can falsify elapsed time or device attributes.
We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently.
peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.


One-time cheats

For example, claiming a PFC of 1e304.

This is handled by the sanity check mechanism, which grants a default amount of credit and treats the host with suspicion for a while.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58155 - Posted: 18 Dec 2021 | 6:29:56 UTC

Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days.

@Abouh needs to be made aware of this and assign fixed credit as what they do with acemd3.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 58157 - Posted: 18 Dec 2021 | 16:30:01 UTC
Last modified: 18 Dec 2021 | 16:45:56 UTC

Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios.
1. Rig-44 has an i9-9980XE 18c36t 32 GB with 16 GB swap file, SSD, and 2 x 2080 Ti's. The GPU use is so low I switched GPU usage to 0.5 for both OPNG and PG and reread config files. OPNG WUs started running and have all been reported fine. PG WUs kept running. Then I started adding back in gene_pcim WUs. When I exceeded 4 gene_pcim WUs the CPU use bands changed shape in a similar way to Rig-24 with a tight band around 30% and a number of curves bouncing off 100%.

2. Rig-26 has an E5-2699 22c44t 32 GB with 16 GB swap (yet to be used), SSD, and a 2080 Ti. I've added back 24 gene_pcim WUs and the CPU use band has moved up to 40-80% with no peaks hitting 100%. Next I changed GPU usage to 0.5 for both OPNG and PG and reread config files. Both seem to be running fine.

3. Rig-24 has an i7-6980X 10c20t 32 GB with a 16 GB swap file, SSD, and a 2080 Ti. This one has been running for 17 hours so far with the last 2 hours having all other CPU work suspended. Its CPU usage graph looks different. There's a tight band oscillating about 20% with a single band oscillating from 60 to 90%. Since PG wants 32 CPUs and this CPU only has 20 there's a constant queue for hyperthreading to feed in. I'll let this one run by itself hoping it finishes soon.

Note: TN-Grid usually runs great in Resource Zero Mode where it rarely ever sends more than one extra WU. With PG running and app_config reducing the max running WUs TN-Grid just keeps sending more WUs. Up to 280 now.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58158 - Posted: 18 Dec 2021 | 17:03:32 UTC - in response to Message 58157.
Last modified: 18 Dec 2021 | 17:11:37 UTC

I did something similar with my two 7xGPU systems.

limited to 5 tasks concurrently.

and set the app_config files up such that it would run either 3x Einstein per GPU, OR 1xEinstein + 1x GPUGRID since the resources used by both are complimentary.

set GPUGRID to 0.6 for GPU use (prevents two from running on the same GPU, 0.6+0.6 >1.0)
set Einstein to 0.33 for GPU use (allows three to run on a single GPU or one GPUGRID + one Einstein, 0.33+0.33+0.33<1.0, 0.6+0.33<1.0)

but running 5 tasks on a system with 64GB system memory was too ambitious, ram use was initially OK, but grew to fill system ram and swap (default 2GB).

if these tasks become more common and plentiful, I might consider upgrading these 7xGPU systems to 128GB RAM so that they can handle running on all GPUs at the same time, but not going to bother if the project decides to reduce the system requirements or these pop up very infrequently.

the low credit reward per unit time due to the BOINC credit fail safe default value should be fixed though. not many people will have much incentive to test out the beta tasks with 10-20x less credit per unit time.

oh and these don't checkpoint properly (they checkpoint once very early on). if you pause a task that's been running for 20hrs, it restarts from that first checkpoint 20hrs ago.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58161 - Posted: 20 Dec 2021 | 10:29:54 UTC
Last modified: 20 Dec 2021 | 13:55:24 UTC

Hello everyone,

The batch I sent on Friday was successfully completed, even if some jobs failed several times initially and got reassigned.

I went through all failed jobs. Here I summarise some errors I have seen:

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
2. Conda environment conflicts with package pinocchio. This one I talked about in a previous post. It requires resetting the app.
3. ´INTERNAL ERROR: cannot create temporary directory!´ - I understand this one could be due to a full disk.

Also, based on the feedback I will work on fixing the following things before the next batch:

1. Checkpoints will be created more often during training. So jobs can be restarted and won’t go back to the beginning.
2. Credits assigned. The idea is to progressively increase the credits until the credit return becomes similar to that of the acemd jobs. However, devising a general formula to calculate them is more complex in this case. For now it is based in the total amount of data samples gathered from the environments and used to train the AI agent, but that does not take into account the size of the agent neural networks. For now we will keep them fixed, but to solve other problems might be necessary to adjust them.

Finally, I think I was a bit too ambitious regarding the total amount of training per job. I will break jobs down in two, so they don't take that long to complete.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58162 - Posted: 20 Dec 2021 | 14:55:18 UTC - in response to Message 58161.

thanks!

I did notice that all of mine failed with exceeded time limit.

might be a good idea to increase the estimated flops size of these tasks so BOINC knows that they are large and will run for a long time.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58163 - Posted: 20 Dec 2021 | 16:44:12 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.

I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app:

Run only the selected applications
ACEMD3: yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): yes
Python Runtime (CPU, beta): yes
Python Runtime (GPU, beta): no

If no work for selected applications is available, accept work from other applications?: no

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0

RuntimeError: CUDA out of memory.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58164 - Posted: 20 Dec 2021 | 17:12:00 UTC - in response to Message 58163.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 16
Credit: 5,861,424,525
RAC: 14,168,255
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58166 - Posted: 20 Dec 2021 | 18:21:34 UTC - in response to Message 58163.

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...


I had the same problem, you need to set the 'Run test applications' to No
It looks like having that set to Yes will over ride any specific application setting you set.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58167 - Posted: 20 Dec 2021 | 19:26:34 UTC - in response to Message 58166.

Thanks, I'll try

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58168 - Posted: 20 Dec 2021 | 19:53:57 UTC - in response to Message 58164.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions.

But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58169 - Posted: 21 Dec 2021 | 5:52:18 UTC - in response to Message 58168.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Would be great if they work on Windows, too :-)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58170 - Posted: 21 Dec 2021 | 9:56:28 UTC - in response to Message 58168.

Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks.

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58171 - Posted: 21 Dec 2021 | 9:57:51 UTC - in response to Message 58169.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58172 - Posted: 21 Dec 2021 | 15:44:20 UTC - in response to Message 58170.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58173 - Posted: 21 Dec 2021 | 16:47:02 UTC - in response to Message 58172.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.


not sure what happened to it. take a look.

https://gpugrid.net/result.php?resultid=32731651
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58174 - Posted: 21 Dec 2021 | 17:16:54 UTC - in response to Message 58173.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58175 - Posted: 21 Dec 2021 | 18:15:03 UTC - in response to Message 58174.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.


It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project.

(i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access)

not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend.

just my .02
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 58176 - Posted: 21 Dec 2021 | 18:44:13 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58177 - Posted: 21 Dec 2021 | 18:58:56 UTC - in response to Message 58171.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.

okay, sounds good; thanks for the information

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58178 - Posted: 21 Dec 2021 | 19:12:20 UTC

I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory.

Much as the previous ones. I thought the memory requirements were going to be cut in half.

Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 58179 - Posted: 21 Dec 2021 | 21:21:09 UTC

Just had one that's listed as "aborted by user." I didn't abort it.
https://www.gpugrid.net/result.php?resultid=32731704

It also says "Please update your install command." I've kept my computer updated. Is this something I need to do?

What's this? Something I need to do or not?
"FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`"

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58180 - Posted: 21 Dec 2021 | 23:12:16 UTC

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

That error on 4 tasks right around 55 minutes on 3080Ti

The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC.

3070Ti has been running for 7:45 with 8% Util and same memory usage.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58181 - Posted: 22 Dec 2021 | 1:34:01 UTC - in response to Message 58179.

The ray errors are normal and can be ignored.
I completed one of the new tasks successfully. The one I commented on before.
14 hours of compute time.

I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58182 - Posted: 22 Dec 2021 | 1:40:18 UTC - in response to Message 58176.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58183 - Posted: 22 Dec 2021 | 9:29:54 UTC - in response to Message 58178.
Last modified: 22 Dec 2021 | 9:30:08 UTC

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58184 - Posted: 22 Dec 2021 | 9:37:48 UTC - in response to Message 58175.
Last modified: 22 Dec 2021 | 9:43:47 UTC

During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way.

wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58185 - Posted: 22 Dec 2021 | 9:43:04 UTC - in response to Message 58176.

Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58186 - Posted: 22 Dec 2021 | 10:07:45 UTC

My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660.

They've both completed and validated their first task, in around 10.5 / 11 hours.

But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return.

Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time.

I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58187 - Posted: 22 Dec 2021 | 11:25:13 UTC - in response to Message 58183.

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.


Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58188 - Posted: 22 Dec 2021 | 15:39:07 UTC

This shows the timing discrepancy, a few minutes before task 32731655 completed.



The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58189 - Posted: 22 Dec 2021 | 15:47:48 UTC - in response to Message 58188.

i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here.

the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58190 - Posted: 22 Dec 2021 | 16:27:52 UTC - in response to Message 58189.

Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project.

The screen shot also shows how the 'remaining time' estimate gets screwed up when the running value reaches something like 10 hours at 10%. Roll on intermediate progress reports and checkpoints.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58191 - Posted: 22 Dec 2021 | 17:05:06 UTC
Last modified: 22 Dec 2021 | 17:05:49 UTC

my system that completed a few tasks had a DCF of 36+

checkpointing also still isn't working. I had some tasks running for ~3hrs. restarted boinc and they restarted at 5mins.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58192 - Posted: 22 Dec 2021 | 18:52:57 UTC - in response to Message 58191.

checkpointing also still isn't working.

See my screenshot.

"CPU time since checkpoint: 16:24:44"

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58193 - Posted: 22 Dec 2021 | 18:59:00 UTC

I've checked a sched_request when reporting.

<result>
<name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name>
<final_cpu_time>55983.300000</final_cpu_time>
<final_elapsed_time>36202.136027</final_elapsed_time>

That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58194 - Posted: 23 Dec 2021 | 10:07:59 UTC - in response to Message 58187.

As mentioned by Ian&Steve C., GPU speed influences only partially task completion time.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

In the last batch, I reduced the total amount of agent-environment interactions gathered and processed before ending the task with respect to the previous batch, which should have reduced the completion time.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58195 - Posted: 23 Dec 2021 | 10:09:32 UTC
Last modified: 23 Dec 2021 | 10:19:03 UTC

I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

Thanks you very much for your feedback. Happy holidays to everyone!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58196 - Posted: 23 Dec 2021 | 13:16:56 UTC - in response to Message 58195.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

The jobs reach us with a workunit description:

<workunit>
<name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name>
<app_name>PythonGPU</app_name>
<version_num>401</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name>
<open_name>run.py</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name>
<open_name>input.zip</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name>
<open_name>requirements.txt</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name>
<open_name>input</open_name>
<copy_file/>
</file_ref>
</workunit>

It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that.

The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58197 - Posted: 23 Dec 2021 | 15:57:01 UTC - in response to Message 58196.
Last modified: 23 Dec 2021 | 21:34:36 UTC

I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server.

Also, I checked the progress and the checkpointing problems. They were caused by format errors.

The python scripts were logging the progress into a "progress.txt" file but apparently BOINC wants just a file "progress" without extension.

Similarly, checkpoints were being generated, but were not identified correctly since they were not called "restart.chk".

I will work on fixing these issues before the next batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58198 - Posted: 23 Dec 2021 | 19:35:37 UTC - in response to Message 58197.

Thanks @abouh for working with us in debugging your application and work units.

Nice to have a attentive and easy to work with researcher.

Looking forward to the next batch.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58200 - Posted: 23 Dec 2021 | 21:20:01 UTC - in response to Message 58194.

Thank you for your kind support.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

This behavior can be seen at some tests described at my Managing non-high-end hosts thread.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58201 - Posted: 24 Dec 2021 | 10:02:52 UTC

I just sent another batch of tasks.

I tested locally and the progress and the restart.chk files are correctly generated and updated.

rsc_fpops_est job parameter should be higher too now.

Please let us know if you think the success rate of tasks can be improved in any other way. Thanks a lot for your help.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58202 - Posted: 24 Dec 2021 | 10:35:31 UTC - in response to Message 58201.

I just sent another batch of tasks.

Thank you very much for this kind of Christmas present!

Merry Christmas to everyone crunchers worldwide 🎄✨

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58203 - Posted: 24 Dec 2021 | 11:38:42 UTC
Last modified: 24 Dec 2021 | 12:09:40 UTC

1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough!

I'll watch this one through, but after that I'll be away for a few days - happy holidays, and we'll pick up again on the other side.

Edit: Progress %age jumps to 10% after the initial unpacking phase, then increments every 0.9%. That'll do.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58204 - Posted: 24 Dec 2021 | 12:51:06 UTC - in response to Message 58201.

I tested locally and the progress and the restart.chk files are correctly generated and updated.
rsc_fpops_est job parameter should be higher too now.

In a preliminary sight of one new Python GPU task received today:
- Progress estimation is now working properly, updating by 0,9% increments.
- Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove
- Checkpointing seems to be working also, and is being stored at about every two minutes.
- Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

Well done!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58208 - Posted: 24 Dec 2021 | 16:43:12 UTC

Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage.

Well done.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58209 - Posted: 24 Dec 2021 | 17:38:21 UTC - in response to Message 58204.

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1
This task has passed the initial processing steps, and has reached the learning cycle phase.
At this point, memory usage is just at the limit of the 4 GB GPU available RAM.
Waiting to see whether this task will be succeeding or not.
System RAM usage keeps being very high.
99% of the 16 GB available RAM at this system is currently in use.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58210 - Posted: 24 Dec 2021 | 22:56:33 UTC - in response to Message 58204.

- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with

<result>
<name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name>
<final_cpu_time>59637.190000</final_cpu_time>
<final_elapsed_time>39080.805144</final_elapsed_time>

That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58218 - Posted: 29 Dec 2021 | 8:31:14 UTC

Hello,

reviewing which jobs failed in the last batches I have seen several times this error:

21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[152341] INTERNAL ERROR: cannot create temporary directory!
[152345] INTERNAL ERROR: cannot create temporary directory!
21:28:08 (152316): /usr/bin/flock exited; CPU time 0.147100
21:28:08 (152316): app exit status: 0x1
21:28:08 (152316): called boinc_finish(195


I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58219 - Posted: 29 Dec 2021 | 9:15:02 UTC - in response to Message 58218.

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?

Right.
I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986
It worked fine for all my hosts.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58220 - Posted: 29 Dec 2021 | 9:26:29 UTC - in response to Message 58219.

Thank you!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58221 - Posted: 29 Dec 2021 | 10:38:21 UTC

Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017

"During handling of the above exception, another exception occurred:"

"ValueError: probabilities are not non-negative"

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58222 - Posted: 29 Dec 2021 | 16:57:53 UTC

it seems checkpointing still isnt working correctly.

despite BOINC "claiming" that it's checkpointing X number of seconds ago, stopping BOINC and re-starting shows that it's not restarting from the checkpoint.

The task I currently have in progress was ~20% completed. stopped BOINC, and restarted and it retained the time (elapsed and CPU time) but progress reset to 10%.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58223 - Posted: 29 Dec 2021 | 17:40:37 UTC - in response to Message 58222.

I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58225 - Posted: 29 Dec 2021 | 23:05:12 UTC

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far.
If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
I grossly estimate that environment for each Python task takes about 16 GB system RAM.
I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM.
I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%.

I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58226 - Posted: 29 Dec 2021 | 23:24:23 UTC - in response to Message 58225.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58227 - Posted: 30 Dec 2021 | 14:40:09 UTC - in response to Message 58222.

Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there.


I have tested locally to stop the task and execute again the python script and it continues from the same point where it stopped. So the script seems correct.


However, I think that right after setting up the conda environment, the progress is set automatically to 10% before running my script, so I am guessing this is what is causing the problem. I have modified my code not to rely only on the progress file, since it might be overwritten after every conda setup to be at 10%.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58228 - Posted: 30 Dec 2021 | 22:35:23 UTC - in response to Message 58226.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.


The last two tasks on my system with a 3080Ti ran concurrently and completed successfully.
https://www.gpugrid.net/results.php?hostid=477247

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58248 - Posted: 6 Jan 2022 | 9:01:57 UTC

Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today):

"wandb: Waiting for W&B process to finish, PID 334655... (failed 1). Press ctrl-c to abort syncing."

"ValueError: demo dir contains more than &#194;&#180;total_buffer_demo_capacity&#194;&#180;"

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58249 - Posted: 6 Jan 2022 | 10:01:11 UTC
Last modified: 6 Jan 2022 | 10:20:07 UTC

One user mentioned that could not solve the error

INTERNAL ERROR: cannot create temporary directory!


This is the configuration he is using:

### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new
contents of the file

PrivateTmp=true

### Lines below this comment will be discarded

### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply
NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target


I was just wondering if there is any possible reason why it should not work
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58250 - Posted: 6 Jan 2022 | 12:01:13 UTC - in response to Message 58249.

I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17

The full, unmodified, contents of the file are

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is.

We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58251 - Posted: 6 Jan 2022 | 12:08:17 UTC - in response to Message 58249.

A simpler answer might be

### Lines below this comment will be discarded

so the file as posted won't do anything at all - in particular, it won't run BOINC!

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58253 - Posted: 7 Jan 2022 | 10:27:24 UTC - in response to Message 58248.

Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it.

I will do local tests and then send a small batch of short tasks to GPUGrid to test the fixed version of the scripts before sending the next big batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58254 - Posted: 7 Jan 2022 | 18:13:15 UTC

Everybody seems to be getting the same error in today's tasks:

"AttributeError: 'PPODBuffer' object has no attribute 'num_loaded_agent_demos'"

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58255 - Posted: 7 Jan 2022 | 19:48:11 UTC

I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report.

No sign of the previous error.

https://www.gpugrid.net/result.php?resultid=32732671

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58256 - Posted: 7 Jan 2022 | 19:56:15 UTC - in response to Message 58255.

Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58263 - Posted: 10 Jan 2022 | 10:26:02 UTC
Last modified: 10 Jan 2022 | 10:28:12 UTC

I just sent a batch that seems to fail with

File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients
if self.iter % self.save_demos_every == 0:
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'


For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience...

I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58264 - Posted: 10 Jan 2022 | 10:38:35 UTC - in response to Message 58263.
Last modified: 10 Jan 2022 | 10:58:56 UTC

Got one of those - failed as you describe.

Also has the error message "AttributeError: 'GWorker' object has no attribute 'batches'".

Edit - had a couple more of the broken ones, but one created at 10:40:34 UTC seems to be running OK. We'll know later!

FritzB
Send message
Joined: 7 Apr 15
Posts: 12
Credit: 2,784,207,771
RAC: 52,658
Level
Phe
Scientific publications
wat
Message 58265 - Posted: 10 Jan 2022 | 14:09:55 UTC - in response to Message 58264.

I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456


Stderr Ausgabe

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/45 [00:00<?, ?it/s]

concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "concurrent/futures/process.py", line 368, in _queue_management_worker
File "multiprocessing/connection.py", line 251, in recv
TypeError: __init__() missing 1 required positional argument: 'msg'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "entry_point.py", line 69, in <module>
File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
File "concurrent/futures/_base.py", line 611, in result_iterator
File "concurrent/futures/_base.py", line 439, in result
File "concurrent/futures/_base.py", line 388, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[6689] Failed to execute script entry_point
13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269
13:25:58 (6392): app exit status: 0x1
13:25:58 (6392): called boinc_finish(195)

</stderr_txt>
]]>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58266 - Posted: 10 Jan 2022 | 16:33:22 UTC - in response to Message 58264.

I errored out 12 tasks created from 10:09:55 to 10:40:06.

Those all have the batch error.

But have 3 tasks created from 10:41:01 to 11:01:56 still running normally

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58268 - Posted: 10 Jan 2022 | 19:39:01 UTC

And two of those were the batch error resends that now have failed.

Only 1 still processing that I assume is of the fixed variety. 8 hours elapsed currently.

https://www.gpugrid.net/result.php?resultid=32732855

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58269 - Posted: 10 Jan 2022 | 21:31:54 UTC - in response to Message 58268.

You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58270 - Posted: 11 Jan 2022 | 8:11:13 UTC - in response to Message 58265.
Last modified: 11 Jan 2022 | 8:11:37 UTC

I have seen this error a few times.

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58271 - Posted: 12 Jan 2022 | 1:15:57 UTC

Might be the OOM-Killer kicking in. You would need to

grep -i kill /var/log/messages*

to check if processes were killed by the OOM-Killer.

If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58272 - Posted: 12 Jan 2022 | 8:56:21 UTC

I Googled the error message, and came up with this stackoverflow thread.

The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual:

"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."

Other search results may provide further clues.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58273 - Posted: 12 Jan 2022 | 15:11:50 UTC - in response to Message 58272.
Last modified: 12 Jan 2022 | 15:24:12 UTC

Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments).

All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58274 - Posted: 12 Jan 2022 | 16:46:57 UTC
Last modified: 12 Jan 2022 | 16:55:04 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

What kernel and OS?

Linux 5.11.0-46-generic x86_64
Ubuntu 20.04.3 LTS

I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58275 - Posted: 12 Jan 2022 | 20:47:57 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

Same Python version as current mine.

In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833
It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him.
mmonnin kindly published an alternative way at his Message #57840

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58276 - Posted: 13 Jan 2022 | 2:31:57 UTC

I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks.

The recent tasks are taking a really long time
2d13h 62,2% 1070 and 1080 GPU system
2d15h 60.4% 1070 and 1080 GPU system

2x concurrently on 3080Ti
2d12h 61.3%
2d14h 60.4%

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58277 - Posted: 13 Jan 2022 | 10:45:46 UTC - in response to Message 58274.

All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment.

Here are the specs from 3 hosts that failed with the BrokenProcessPool error:

OS:
Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)]
Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)]
Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]

Memory:
32081.92 MB
32092.04 MB
9954.41 MB

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58278 - Posted: 13 Jan 2022 | 19:55:11 UTC

I have a failed task today involving pickle.

magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error.

https://www.gpugrid.net/result.php?resultid=32733573

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 146,609,125
RAC: 52,399
Level
Cys
Scientific publications
wat
Message 58279 - Posted: 13 Jan 2022 | 21:18:41 UTC

The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80.

From this task:


[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning:
Found GPU%d %s which is of cuda capability %d.%d.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is %d.%d.


While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58280 - Posted: 13 Jan 2022 | 21:51:08 UTC - in response to Message 58279.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58281 - Posted: 13 Jan 2022 | 22:58:05 UTC - in response to Message 58280.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.


Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58282 - Posted: 13 Jan 2022 | 23:23:11 UTC - in response to Message 58281.
Last modified: 13 Jan 2022 | 23:23:48 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 146,609,125
RAC: 52,399
Level
Cys
Scientific publications
wat
Message 58283 - Posted: 14 Jan 2022 | 2:21:35 UTC - in response to Message 58280.

Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58284 - Posted: 14 Jan 2022 | 9:41:19 UTC - in response to Message 58278.

Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file.

I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error.

Thank you for bringing it up! I will review the code to see if I can find any bug related to that.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58285 - Posted: 14 Jan 2022 | 20:12:28 UTC - in response to Message 58284.

This is the document I had found about fixing the BrokenProcessPool error.

https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor

I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before.

When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,755,599,773
RAC: 675,868
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58286 - Posted: 14 Jan 2022 | 20:25:49 UTC

@abouh: Thank you for PM me twice!
The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58288 - Posted: 15 Jan 2022 | 22:24:17 UTC - in response to Message 58225.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).

After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks:
e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458
e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477
e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441

More details at regarding Message #58287

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58289 - Posted: 17 Jan 2022 | 8:36:42 UTC

Hello everyone,

I have seen a new error in some jobs:


Traceback (most recent call last):
File "run.py", line 444, in <module>
main()
File "run.py", line 62, in main
wandb.login(key=str(args.wandb_key))
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login
configured = _login(**kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login
wlogin.configure_api_key(key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key
apikey.write_key(self._settings, key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key
api.clear_setting("anonymous", globally=True, persist=True)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting
return self.api.clear_setting(*args, **kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api
self._api = InternalApi(*self._api_args, **self._api_kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__
self._settings = Settings(
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__
self._global_settings.read([Settings._global_path()])
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path
util.mkdir_exists_ok(config_dir)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok
os.makedirs(path)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client'
18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031
18:56:50 (54609): app exit status: 0x1
18:56:50 (54609): called boinc_finish(195)

</stderr_txt>


It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58290 - Posted: 17 Jan 2022 | 9:36:10 UTC - in response to Message 58289.

My question would be: what is the working directory?

The individual line errors concern

/home/boinc-client/slots/1/...

but the final failure concerns

/var/lib/boinc-client

That sounds like a mixed-up installation of BOINC: 'home' sounds like a location for a user-mode installation of BOINC, but '/var/lib/' would be normal for a service mode installation. It's reasonable for the two different locations to have different write permissions.

What app is doing the writing in each case, and what account are they running under?

Could the final write location be hard-coded, but the others dependent on locations supplied by the local BOINC installation?

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58291 - Posted: 17 Jan 2022 | 12:51:27 UTC

Hi

I've the same issue regarding boinc-directory (boinc dir is setup to ~/boinc)

So, I cleanup ~/.conda directory and reinstall gpugridnet project to the boinc client

So , flock detect the right running boinc directory but now I have this error task

https://www.gpugrid.net/result.php?resultid=32734225

./gpugridpy/bin/python (I think this is in boinc/slots/<N>/ folder)

The WU is running and 0.43% completed but /home/<user>/boinc/slots/11/gpugridpy still empty. No data are writted .

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58292 - Posted: 17 Jan 2022 | 15:28:21 UTC - in response to Message 58290.
Last modified: 17 Jan 2022 | 15:55:31 UTC

Right so the working directory is

/home/boinc-client/slots/1/...


to which the script has full access. The script tries to create a directory to save the logs, but I guess it should not do it in

/var/lib/boinc-client


So I think the problem is just that the package I am using to log results by default saves them outside the working directory. Should be easy to fix.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58293 - Posted: 17 Jan 2022 | 15:55:05 UTC - in response to Message 58292.

BOINC has the concept of a "data directory". Absolutely everything that has to be written should be written somewhere in that directory or its sub-directories. Everything else must be assumed to be sandboxed and inaccessible.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58294 - Posted: 17 Jan 2022 | 16:17:56 UTC - in response to Message 58282.



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58295 - Posted: 17 Jan 2022 | 21:07:22 UTC - in response to Message 58294.
Last modified: 17 Jan 2022 | 21:52:59 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.


what motherboard? and what version of BOINC?, your hosts are hidden so I cannot inspect myself. PCIe enumeration and ordering can be inconsistent against consumer boards. My server boards seem to enumerate starting from the slot furthest from the CPU socket, while most consumer boards are the opposite with device0 at the slot closest to the CPU socket.

or do you perhaps run a locked coproc_info.xml file, this would prevent any GPU changes from being picked up by BOINC if it can't write to the coproc file.

edit:

also I forgot that most versions of BOINC incorrectly detect nvidia GPU memory. they will all max out at 4GB due to a bug in BOINC. So to BOINC your 1080Ti has the same amount of memory as your 1080. and since the 1080Ti is still a pascal card like the 1080, it has the same compute capability, so you're running into the same specs between them all still

to get it to sort properly, you need to fix BOINC code, or use a GPU with higher or lower compute capability. put a Turing card in the system not in the first slot and BOINC will pick it up as GPU0
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58296 - Posted: 18 Jan 2022 | 19:03:55 UTC

The tests continue. Just reported e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1, with final stats

<result>
<name>e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1</name>
<final_cpu_time>107668.100000</final_cpu_time>
<final_elapsed_time>46186.399529</final_elapsed_time>

That's an average CPU core count of 2.33 over the entire run - that's high for what is planned to be a GPU application. We can manage with that - I'm sure we all want to help develop and test the application for the coming research run - but I think it would be helpful to put more realistic usage values into the BOINC scheduler.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 58297 - Posted: 19 Jan 2022 | 9:17:03 UTC - in response to Message 58296.

It's not a GPU application. It uses both CPU and GPU.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58298 - Posted: 19 Jan 2022 | 9:49:39 UTC - in response to Message 58296.

Do you mean changing some of the BOINC parameters like it was done in the case of <rsc_fpops_est>?

Is that to better define the resources required by the tasks?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58299 - Posted: 19 Jan 2022 | 11:03:54 UTC - in response to Message 58298.

It would need to be done in the plan class definition. Toni said that you define your plan classes in C++ code, so there are some examples in Specifying plan classes in C++.

Unfortunately, the BOINC developers didn't consider your use-case of mixing CPU elements and GPU elements in the same task, so none of the examples really match - your app is a mixture of MT and CUDA classes. What we need (or at least, would like to see) at this end are realistic values for <avg_ncpus> and <coproc><count>.

FritzB
Send message
Joined: 7 Apr 15
Posts: 12
Credit: 2,784,207,771
RAC: 52,658
Level
Phe
Scientific publications
wat
Message 58300 - Posted: 19 Jan 2022 | 19:00:18 UTC

it seems to work better now but I've reached time limit after 1800sec
https://www.gpugrid.net/result.php?resultid=32734648


19:39:23 (6124): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58301 - Posted: 19 Jan 2022 | 20:55:08 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58302 - Posted: 19 Jan 2022 | 22:28:41 UTC - in response to Message 58301.

I'm still running them at 1 CPU plus 1 GPU. They run fine, but when they are busy on the CPU-only sections, they steal time from the CPU tasks that are running at the same time - most obviously from CPDN.

Because these tasks are defined as GPU tasks, and GPU tasks are given a higher run priority than CPU tasks by BOINC ('below normal' against 'idle'), the real CPU project will always come off worst.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58303 - Posted: 20 Jan 2022 | 0:27:39 UTC - in response to Message 58302.
Last modified: 20 Jan 2022 | 0:28:14 UTC

You could employ ProcessLasso on the apps and up their priority I suppose.

When I ran Windows, I really utilized that utility to make the apps run the way I wanted them to, and not how BOINC sets them up on its own agenda.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58304 - Posted: 20 Jan 2022 | 6:46:45 UTC - in response to Message 58301.

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I think that Python GPU App is very efficient in adapting to any amount of CPU cores, and taking profit of available CPU resources.
This seems to be in some way independent of ncpus parameter at Gpugrid app_config.xml

Setup at my twin GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.49</cpu_usage>
</gpu_versions>
</app>

And setup for my triple GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.33</cpu_usage>
</gpu_versions>
</app>

The finality for this is being able to respectively run two or three concurrent Python GPU tasks without reaching a full "1" CPU core (2 x 0.49 = 0.98; 3 x 0.33 = 0.99). Then, I manually control CPU usage by setting "Use at most XX % of the CPUs" at BOINC Manager for each system, according to its amount of CPU cores.
This allows me to run concurrently "N" Python GPU tasks and a fixed number of other CPU tasks as desired.
But as said, Gpugrid Python GPU app seems to take CPU resources as needed for successfully processing its tasks... at the cost of slowing down the other CPU applications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58305 - Posted: 20 Jan 2022 | 7:44:41 UTC

Yes, I use Process Lasso on all my Windows machines, but I haven't explored its use under Linux.

Remember that ncpus and similar has no effect whatsoever on the actual running of a BOINC project app - there is no 'control' element to its operation. The only effect it has is on BOINC's scheduling - how many tasks are allowed to run concurrently.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58306 - Posted: 20 Jan 2022 | 15:58:45 UTC - in response to Message 58300.

This message

19:39:23 (6124): task /usr/bin/flock reached time limit 1800


Indicates that, after 30 minutes, the installation of miniconda and the task environment setup have not been finished.

Consequently, python is not found later on to execute the task since it is one of the requirements of the miniconda environment.

application ./gpugridpy/bin/python missing


Therefore, it is not an error in itself, it just means that the miniconda setup went too slow for some reason (in theory 30 minutes should be enough time). Maybe the machine is slower than usual for some reason. Or the connection is slow and dependencies are not being downloaded.

We could extend this timeout, but normally if 30 minutes is not enough for the miniconda setup another underlying problem could exists.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58307 - Posted: 20 Jan 2022 | 16:18:58 UTC - in response to Message 58306.

it seems to be a reasonably fast system. my guess is another type of permissions issue which is blocking the python install and it hits the timeout, or the CPUs are being too heavily used and not giving enough resources to the extraction process.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58308 - Posted: 20 Jan 2022 | 22:15:20 UTC - in response to Message 58305.

There is no Linux equivalent of Process Lasso.

But there is a Linux equivalent of Windows Process-Explorer

https://github.com/wolfc01/procexp

Screenshots of the application at the old SourceForge repo.

https://sourceforge.net/projects/procexp/

Can dynamically change the nice value of the application.

There is also the command line schedtool utility that can be easily implemented in a bash file. I used to run that all the time in my gpuoverclock.sh script for Seti cpu and gpu apps.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58309 - Posted: 21 Jan 2022 | 12:14:55 UTC - in response to Message 58308.

Well, that got me a long way.

There are dependencies listed for Mint 18.3 - I'm running Mint 20.2

The apt-get for the older version of Mint returns

E: Unable to locate package python-qwt5-qt4
E: Unable to locate package python-configobj

Unsurprisingly, the next step returns

Traceback (most recent call last):
File "./procexp.py", line 27, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets, uic
ModuleNotFoundError: No module named 'PyQt5'

htop, however, shows about 30 multitasking processes spawned from main, each using around 2% of a CPU core (varying by the second) at nice 19. At the time of inspection, that is. I'll go away and think about that.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58310 - Posted: 21 Jan 2022 | 17:41:41 UTC - in response to Message 58300.

I've one task now that had the same timeout issue getting python. The host was running fine on these tasks before and I don't know what has changed.

I've aborted a couple tasks now that are not making any progress after 20 hours or so and are stuck at 13% completion. Similar series tasks are showing much more progress after only a few minutes. Most complete in 5-6 hours.

I reset the project thinking something got corrupted in the downloaded libraries but that has not fixed anything.

Need to figure out how to debug the tasks on this host.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58311 - Posted: 21 Jan 2022 | 17:42:23 UTC - in response to Message 58309.

You might look into schedtool as an alternative.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 58317 - Posted: 29 Jan 2022 | 21:23:39 UTC - in response to Message 58301.
Last modified: 29 Jan 2022 | 22:08:45 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.
Very interesting. Does this actually limit PythonGPU to using at most 5 cpu threads?
Does it work better than:
<app_config>
<!-- i9-7980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<avg_ncpus>5</avg_ncpus>
<cmdline>--nthreads 5</cmdline>
<fraction_done_exact/>
</app>
</app_config>
Edit 1: To answer my own question I changed cpu_usage to 5 and am running a single PythonGPU WU with nothing else going on. The System Monitor shows 5 CPUs are running in the 60 to 80% range with all othe CPU running in the 10 to 40% range.
Is there any way to stop it from taking over ones entire computer?
Edit 2: I turned on WCG and the group of 5 went up to 100% and all the rest went to OPN in the 80 to 95% range.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58318 - Posted: 30 Jan 2022 | 5:24:25 UTC - in response to Message 58317.

No. Setting that value won’t change how much CPU is actually used. It just tells BOINC how much of the CPU is being used so that it can probably account resources.

This app will use 32 threads and there’s nothing you can do in BOINC configuration to change that. This has always been the case though.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58320 - Posted: 2 Feb 2022 | 22:06:09 UTC

This morning, in a routine system update, I noticed that BOINC Client / Manager was updated from Version 7.16.17 to Version 7.18.1.
It would be interesting to know if PrivateTmp=true is set as a default at this new version, thus in some way helping for Python GPU task to succeed...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58321 - Posted: 2 Feb 2022 | 23:06:32 UTC - in response to Message 58320.

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good (it's been available for Android since August last year), but I don't yet know the answer to your specific question - there hasn't been any chatter about testing or new releases in the usual places.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58322 - Posted: 2 Feb 2022 | 23:47:29 UTC - in response to Message 58321.
Last modified: 2 Feb 2022 | 23:50:53 UTC

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good

It bombed out on the Rosetta pythons; they did not run at all (a VBox problem undoubtedly). And it failed all the validations on QuChemPedIA, which does not use VirtualBox on the Linux version. But it works OK on CPDN, WCG/ARP and Einstein/FGRBP (GPU). All were on Ubuntu 20.04.3.

So be prepared to bail out if you have to.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58324 - Posted: 3 Feb 2022 | 6:29:43 UTC - in response to Message 58321.

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58325 - Posted: 3 Feb 2022 | 9:25:23 UTC - in response to Message 58324.

My PPA gives slightly more information on the available update:



I know that it's auto-generated from the Debian package maintenance sources, which is probably the ultimate source of the Ubuntu LTS package as well. I've had a quick look round, but there's no sign so far that this release was originated by BOINC developers: in particular, no mention was made of it during the BOINC projects conference call on January 14th 2022. I'll keep digging.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58327 - Posted: 3 Feb 2022 | 12:13:36 UTC
Last modified: 3 Feb 2022 | 12:34:19 UTC

OK, I've taken a deep breath and enough coffee - applied all updates.

WARNING - the BOINC update appears to break things.

The new systemd file, in full, is

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true #Block X11 idle detection

[Install]
WantedBy=multi-user.target

Note the line I've picked out. That starts with a # sign, for comment, so it has no effect: PrivateTmp is undefined in this file.

New work became available just as I was preparing to update, so I downloaded a task and immediately suspended it. After the updates, and enough reboots to get my NVidia drivers functional again (it took three this time), I restarted BOINC and allowed the task to run.

Task 32736884

Our old enemy "INTERNAL ERROR: cannot create temporary directory!" is back. Time for a systemd over-ride file, and to go fishing for another task.

Edit - updated the file, as described in message 58312, and got task 32736938. That seems to be running OK, having passed the 10% danger point. Result will be in sometime after midnight.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58328 - Posted: 3 Feb 2022 | 23:34:25 UTC

I see your task completed normally with the PrivateTmp=true uncommented in the service file.

But is the repeating warning:

wandb: WARNING Path /var/lib/boinc-client/slots/11/.config/wandb/wandb/ wasn't writable, using system temp directory

a normal entry for those using the standard BOINC location installation?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58329 - Posted: 4 Feb 2022 | 9:04:58 UTC - in response to Message 58328.

No, that's the first time I've seen that particular warning. The general structure is right for this machine, but it does't usually reach as high as 11 - GPUGrid normally gets slot 7. Whatever - there were some tasks left waiting after the updates and restarts.

I think this task must have run under a revised version of the app - the next stage in testing. The output is slightly different in other ways, and the task ran for a significantly shorter time than other recent tasks. My other machine, which hasn't been updated yet, got the same warnings in a task running at the same time.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58330 - Posted: 4 Feb 2022 | 9:14:25 UTC - in response to Message 58328.
Last modified: 4 Feb 2022 | 9:23:48 UTC

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58331 - Posted: 4 Feb 2022 | 9:25:40 UTC - in response to Message 58329.

Yes, this experiments is with a slightly modified version of the algorithm, which should be faster. It runs the same number of interactions with the reinforcement learning environment, so the credits amount is the same.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58332 - Posted: 4 Feb 2022 | 9:38:39 UTC - in response to Message 58330.

I'll take a look at the contents of the slot directory, next time I see a task running. You're right - the entire '/var/lib/boinc-client/slots/n/...' structure should be writable, to any depth, by any program running under the boinc user account.

How is the '.config/wandb/wandb/' component of the path created? The doubled '/wandb' looks unusual.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58333 - Posted: 4 Feb 2022 | 9:44:30 UTC - in response to Message 58332.
Last modified: 4 Feb 2022 | 9:55:30 UTC

The directory paths are defined as environment variables in the python script.

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.path.join(os.getcwd(), ".config/wandb")


Then the directories are created by the wandb python package (which handles logging of relevant training data). I suspect it could be in the creation that the permissions are defined. So it is not a BOINC problem. I will change the paths in future jobs to:

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.getcwd()


Note that "os.getcwd()" is the working directory, so "/var/lib/boinc-client/slots/11/" in this case
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58334 - Posted: 4 Feb 2022 | 13:32:42 UTC - in response to Message 58330.

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.


what happens if that directory doesn't exist? several of us run BOINC in a different location. since it's in /var/lib/ the process wont have permissions to create the directory, unless maybe if BOINC is run as root.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58335 - Posted: 4 Feb 2022 | 14:22:26 UTC - in response to Message 58334.

'/var/lib/boinc-client/' is the default BOINC data directory for Ubuntu BOINC service (systemd) installations. It most certainly exists, and is writable, on my machine, which is where Keith first noticed the error message in the report of a successful run. During that run, much will have been written to .../slots/11

Since abouh is using code to retrieve the working (i.e. BOINC slot) directory, the correct value should be returned for non-default data locations - otherwise BOINC wouldn't be able to run at all.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58336 - Posted: 4 Feb 2022 | 15:33:49 UTC - in response to Message 58335.
Last modified: 4 Feb 2022 | 15:39:39 UTC

I'm aware it's the default location on YOUR computer, and others running the standard ubuntu repository installer. but the message from abouh sounded like this directory was hard coded since he put the entire path. and for folks running BOINC in another location, this directory will not be the same. if it uses a relative file path, then it's fine, but I was seeking clarification.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58337 - Posted: 4 Feb 2022 | 15:59:00 UTC - in response to Message 58336.
Last modified: 4 Feb 2022 | 16:21:03 UTC

Hard path coding was removed before this most recent test batch.

edit - see message 58292: "Should be easy to fix".

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58338 - Posted: 4 Feb 2022 | 22:13:21 UTC - in response to Message 58336.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.


Yes. I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client


I also do these to allow monitoring by BoincTasks over the LAN on my Win10 machine:
• Copy “cc_config.xml” to /etc/boinc-client folder
• Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder
• Reboot

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58339 - Posted: 5 Feb 2022 | 9:10:09 UTC - in response to Message 58334.
Last modified: 5 Feb 2022 | 11:01:11 UTC

The directory should be created wherever you run BOINC, that is not a problem.

Inside the /boinc-client directory, but it does not matter if this directory is in /var/lib/ or somewhere else.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58340 - Posted: 5 Feb 2022 | 11:05:20 UTC - in response to Message 58338.
Last modified: 5 Feb 2022 | 11:05:38 UTC

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58341 - Posted: 5 Feb 2022 | 11:50:02 UTC - in response to Message 58327.
Last modified: 5 Feb 2022 | 12:07:55 UTC

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58342 - Posted: 5 Feb 2022 | 13:27:24 UTC - in response to Message 58340.

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

I am on an isolated network behind a firewall/router. No problem at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58343 - Posted: 5 Feb 2022 | 13:28:42 UTC - in response to Message 58342.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58344 - Posted: 5 Feb 2022 | 13:30:13 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

All I know is that the new build does not work at all on Cosmology with VirtualBox 6.1.32. A work unit just suspends immediately on startup.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58345 - Posted: 5 Feb 2022 | 13:30:54 UTC - in response to Message 58343.
Last modified: 5 Feb 2022 | 13:33:37 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58346 - Posted: 5 Feb 2022 | 13:34:08 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58347 - Posted: 5 Feb 2022 | 13:40:51 UTC - in response to Message 58345.
Last modified: 5 Feb 2022 | 13:41:07 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58348 - Posted: 5 Feb 2022 | 13:56:12 UTC - in response to Message 58347.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58349 - Posted: 5 Feb 2022 | 14:08:17 UTC - in response to Message 58348.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58350 - Posted: 5 Feb 2022 | 14:11:10 UTC - in response to Message 58349.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

What comparable isolation do you get in Windows from one program to another?
Or what security are you talking about? Port security from external sources?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58351 - Posted: 5 Feb 2022 | 15:28:34 UTC - in response to Message 58350.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?
What comparable isolation do you get in Windows from one program to another?
Security descriptors introduced into the NTFS 1.2 file system released in 1996 with Windows NT 4.0. The access control lists in NTFS are more complex in some aspects than in Linux. All modern Windows use NTFS by default.
User Account Control is introduced in 2007 with Windows Vista (=apps doesn't run as administrator even if the user has administrative privileges until the user elevates it through an annoying popup)
Or what security are you talking about? Port security from external sources?
Windows firewall is introced with Windows XP SP2 in 2004.

This is my last post in this thread about (undermining) filesystem security.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58352 - Posted: 5 Feb 2022 | 16:53:05 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Updated my second machine. It appears that this re-release is NOT releated to the systemd problem: the PrivateTmp=true line is still commented out.

Re-apply the fix (#1) from message 58312 after applying this update, if you wish to continue running the Python test apps.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58353 - Posted: 5 Feb 2022 | 16:54:05 UTC - in response to Message 58351.
Last modified: 5 Feb 2022 | 17:25:41 UTC

I think you are correct, except in the term "undermining", which is not appropriate for isolated crunching machines. There is a billion-dollar AV industry for Windows. Apparently someone has figured out how to undermine it there. But I agree that no more posts are necessary.

EDIT: I probably should have said that it was only for isolated crunching machines at the outset. If I were running a server, I would do it differently.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58354 - Posted: 5 Feb 2022 | 18:15:50 UTC
Last modified: 5 Feb 2022 | 18:16:08 UTC

While chmod 777-ing in general is a bad practice. There’s little harm in blowing up the BOINC directory like that. Worst that can happen is you modify or delete a necessary file by accident and break BOINC. Just reinstall and learn the lesson. Not the end of the world in this instance.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58355 - Posted: 5 Feb 2022 | 19:20:07 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.


Ubuntu 20.04.3 LTS is still on the older 7.16.6 version.

apt list boinc-client
Listing... Done
boinc-client/focal 7.16.6+dfsg-1 amd64

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58356 - Posted: 5 Feb 2022 | 19:26:13 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58357 - Posted: 5 Feb 2022 | 22:22:11 UTC - in response to Message 58356.

I think they use a different PPA, not the standard Ubuntu version.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58358 - Posted: 5 Feb 2022 | 22:52:53 UTC - in response to Message 58356.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.
It's from http://ppa.launchpad.net/costamagnagianfranco/boinc/ubuntu
Sorry for the confusion.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58359 - Posted: 5 Feb 2022 | 23:07:14 UTC - in response to Message 58357.

I think they use a different PPA, not the standard Ubuntu version.

You're right. I've checked, and this is my complete repository listing.
There are new pending updates for BOINC package, but I've recently catched an ACEMD3 ADRIA new task, and I'm not updating until it be finished and reported.
My experience warns that these tasks are highly prone to fail if something is changed while processing.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58360 - Posted: 6 Feb 2022 | 8:10:43 UTC - in response to Message 58324.
Last modified: 6 Feb 2022 | 8:15:37 UTC

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Ah. Your reply here gave me a different impression. Slight egg on face, but both our Linux update manager screenshots fail to give source information in their consolidated update lists. Maybe we should put in a feature request?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58361 - Posted: 6 Feb 2022 | 12:39:46 UTC
Last modified: 6 Feb 2022 | 12:40:31 UTC

ACEMD3 task finished on my original machine, so I updated BOINC from PPA 2022-01-30 to 2022-02-04.

I can confirm that if you used systemctl/edit to create a separate over-ride file, it remains in place - no need to re-edit every time. If you used a text editor to edit the raw systemd file in place, of course, it'll get over-written and will need editing again.

(final proof-of-the-pudding of that last statement awaits the release of the next test batch)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58362 - Posted: 6 Feb 2022 | 17:13:30 UTC

Got a new task (task 32738148). Running normally, confirms override to systemd is preserved.

Getting entries in stderr as before:

wandb: WARNING Path /var/lib/boinc-client/slots/7/.config/wandb/wandb/ wasn't writable, using system temp directory

(we're back in slot 7 as usual)

There are six folders created in slot 7:

agent_demos
gpugridpy
int_demos
monitor_logs
python_dependencies
ROMS

There are no hidden folders, and certainly no .config

wandb data is in:

/tmp/systemd-private-f670b90d460b4095a25c37b7348c6b93-boinc-client.service-7Jvpgh/tmp

There are 138 folders in there, including one called simply wandb

wandb contains:

debug-internal.log
debug.log
latest-run
run-20220206_163543-1wmmcgi5

The first two are files, the last two are folders. There is no subfolder called wandb - so no recursion, such as the warning message suggests. Hope that helps.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58363 - Posted: 7 Feb 2022 | 8:13:08 UTC - in response to Message 58362.

Thanks! the content of the slot directory is correct.

The wandb directory will be also placed in the slot directory soon, in the next experiment. During the current experiment, which consists of multiple batches of tasks, the wandb directory will be still in /tmp, as a result of the warning.

That is not a problem per se, but I agree that will be cleaner to place it in the slot directory, so all BOINC files are there.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58364 - Posted: 9 Feb 2022 | 9:56:19 UTC - in response to Message 58363.

wandb: Run data is saved locally in /var/lib/boinc-client/slots/7/wandb/run-20220209_082943-1pdoxrzo

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58365 - Posted: 10 Feb 2022 | 9:33:48 UTC - in response to Message 58364.
Last modified: 10 Feb 2022 | 9:34:28 UTC

Great, thanks a lot for the confirmation. So now it seems the directory is appropriate one.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 146,609,125
RAC: 52,399
Level
Cys
Scientific publications
wat
Message 58367 - Posted: 17 Feb 2022 | 17:38:34 UTC

Pretty happy to see that my little Quadro K620s could actually handle one of the ABOU work units. Successfully ran one in under 31 hours. It didn't hit the memory too hard, which helps. The K620 has a DDR3 memory bus so the bandwidth is pretty limited.

http://www.gpugrid.net/result.php?resultid=32741283

Though, it did fail one of the Anaconda work units that went out. The error message doesn't mean much to me.

http://www.gpugrid.net/result.php?resultid=32741757


Traceback (most recent call last):
File "run.py", line 40, in <module>
assert os.path.exists('output.coor')
AssertionError
11:22:33 (1966061): ./gpugridpy/bin/python exited; CPU time 0.295254
11:22:33 (1966061): app exit status: 0x1
11:22:33 (1966061): called boinc_finish(195)

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,714,356,441
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58368 - Posted: 17 Feb 2022 | 20:12:35 UTC

All tasks goes in errors on this machine : https://www.gpugrid.net/results.php?hostid=591484

I specify that the machine does not have a GPU usable by BOINC.

Thanks for your help.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58369 - Posted: 18 Feb 2022 | 10:27:49 UTC - in response to Message 58368.

I got two of those yesterday as well. They are described as "Anaconda Python 3 Environment v4.01 (mt)" - declared to run as multi-threaded CPU tasks. I do have working GPUs (on host 508381), but I don't think these tasks actually need a GPU.

The task names refer to a different experimenter (RAIMIS) from the ones we've been discussing recently in this thread.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58370 - Posted: 18 Feb 2022 | 18:55:22 UTC

We were running those kind of tasks a year ago. Looks like the researcher has made an appearance again.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58371 - Posted: 18 Feb 2022 | 21:12:05 UTC
Last modified: 18 Feb 2022 | 21:47:13 UTC

I just downloaded one, but it errored out before I could even catch it starting. It ran for 3 seconds, required four cores of a Ryzen 3950X on Ubuntu 20.04.3, and had an estimated time of 2 days. I think they have some work to do.
http://www.gpugrid.net/result.php?resultid=32742752

PS
- It probably does not help that that machine is running BOINC 7.18.1. I have had problems with it before. I will try 7.16.6 later.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58372 - Posted: 18 Feb 2022 | 22:14:30 UTC - in response to Message 58371.
Last modified: 18 Feb 2022 | 22:15:49 UTC

PPS - It ran for two minutes on an equivalent Ryzen 3950X running BOINC 7.16.6, and then errored out.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 943,141,953
RAC: 1,799,447
Level
Glu
Scientific publications
wat
Message 58373 - Posted: 22 Feb 2022 | 19:31:41 UTC - in response to Message 58372.

I just ran 4 of the Python CPU tasks wu's on my Ryzen 7 5800H, Ubuntu 20.04.3 LTS, 16 GB ram. Each was run on 4 CPU threads at the same time. The first 0,6% took over 10 minutes, then they jumped to 10%, continued a while longer until 17 minutes were over and then erroed out all at more or less the same moment in the task. Here is one example: 32743954

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58374 - Posted: 23 Feb 2022 | 6:32:16 UTC - in response to Message 58373.

A RAIMIS MT task - which accounts for the 4 threads.

And yet -

Run
CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

NVIDIA GeForce RTX 3060 Laptop GPU (4095MB)

Traceback (most recent call last):
File "/var/lib/boinc-client/slots/5/run.py", line 50, in <module>
assert os.path.exists('output.coor')
AssertionError

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58380 - Posted: 24 Feb 2022 | 22:19:06 UTC

I am running two of the Anacondas now. They each reserve four threads, but are apparently only using one of them, since BoincTasks shows 25% CPU usage.

They have been running for two hours, and should complete in 14 hours total, though the estimates are way off and show 12 days. Therefore, they are running high priority even though they should complete with no problem.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 943,141,953
RAC: 1,799,447
Level
Glu
Scientific publications
wat
Message 58381 - Posted: 25 Feb 2022 | 18:29:21 UTC - in response to Message 58374.

Hey Richard. In how far is my GPU's memory involved in a CPU task?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58382 - Posted: 25 Feb 2022 | 19:21:40 UTC - in response to Message 58381.

Hey Richard. In how far is my GPU's memory involved in a CPU task?

It shouldn't be - that's why I drew attention to it. I think both AbouH and RAIMIS are experimenting with different applications, which exploit
both GPUs and multiple CPUs.

It isn't at all obvious how best to manage a combination like that under BOINC - the BOINC developers only got as far as thinking about either/or, not both together.

So far, Abou seems to have got further down the road, but I'm not sure how much further development is required. We watch and wait, and help where we can.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58383 - Posted: 26 Feb 2022 | 15:08:37 UTC
Last modified: 26 Feb 2022 | 15:13:53 UTC

My first two Anacondas ended OK after 31 hours. But they were _2 and _3.
I am not sure what the error messages mean. Some ended after a couple of minutes, while others went longer.
http://www.gpugrid.net/results.php?hostid=593715

I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58384 - Posted: 27 Feb 2022 | 15:28:01 UTC - in response to Message 58383.

I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.

It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58417 - Posted: 3 Mar 2022 | 16:29:10 UTC - in response to Message 58382.

Hello everyone! I am sorry for the late reply.

Now that most of my jobs seem to complete successfully, we decided to remove the "beta" flag from the app. I would like to thank you all for your help during the past months to reach this point. Obviously I will try to solve any further problem detected. In the future we will try to extend it for Windows, but we are not there yet.

Regarding the app requirements, from now on they will be similar to those in my last batches. In reinforcement learning, in general there is no way around the mixed CPU/GPU usage. Most reinforcement learning environments are powered by CPU, but the machine learning algorithms to teach agents to solve the environments use GPU.

RAMIS was experimenting with a different application. But the idea is that another beta app will be created for this purpose.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58464 - Posted: 8 Mar 2022 | 18:06:03 UTC
Last modified: 8 Mar 2022 | 18:53:42 UTC

Is this a record?



Initial runtime estimate for:

e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
Python apps for GPU hosts beta v1.00 (cuda1131) for Windows

Task 32766826

Time to lie back and enjoy the popcorn for ... 11½ years ??!!

Edit - 36 minutes to download 2.52 GB, less than a minute to crash. Ah well, back to the drawing board.

08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished


Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that.

Download from https://sourceforge.net/projects/gnuwin32/files/tar/

NO - that wasn't what it said it was. Looking again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58465 - Posted: 8 Mar 2022 | 19:37:16 UTC

No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58466 - Posted: 8 Mar 2022 | 23:51:15 UTC
Last modified: 8 Mar 2022 | 23:55:06 UTC

Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks.

11:37 hr:min 79.3% 61d2h
10:04 hr:min 73.9% 77d2h

74.8% dropped on the 2nd task it down to 74d10h. Around 215d initial ETA?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58467 - Posted: 9 Mar 2022 | 8:54:44 UTC
Last modified: 9 Mar 2022 | 9:22:43 UTC

No need to go back to the drawing board, in principle. Here is what is happening:

1. The PythonGPU app should be stable now and only available for Linux (like until now). Jobs are being sent there and should work normally.

2. A new app, called PythonGPUbeta, has been deployed for both Linux and Windows. The idea is to test now the python jobs for Windows. The source of bugs to solve should be this one now... Ultimately the idea is to have a common PythonGPU for both OS.

3. While PythonGPUbeta accepts Linux and Windows, I expect most errors to come from the Windows part.

Please, let me know if any of the following is not correct.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58468 - Posted: 9 Mar 2022 | 9:02:10 UTC - in response to Message 58464.
Last modified: 9 Mar 2022 | 9:28:51 UTC

In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works.

We are testing which compression format is best for our purpose. We tested first with a tar.bz2 file. For Linux there was no problem to decompress it.

For windows, I tested locally in a Windows 10 laptop. I could decompress it successfully with tar.exe.

I am not sure what is happening with the estimates, but the estimation is obviously wrong. The test jobs should download the conda environment only in the first job, decompress it and finally run a short python program using CPU and GPU. Are the Linux estimates also so exagerated?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58469 - Posted: 9 Mar 2022 | 9:07:09 UTC
Last modified: 9 Mar 2022 | 9:32:49 UTC

Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe.

Also, I have seen some jobs with the following error:

tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d"


In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58470 - Posted: 9 Mar 2022 | 9:35:49 UTC

Don't worry, it's only my own personal drawing board that I'm going back to!

Microsoft has form in this area. I remember buying a commercial copy of WinZip for use with Windows 3 - it arrived by post, on a single floppy disk. Later, they bought the company and incorporated it into Windows. Microsoft tend to do this very late in the day - hence my problems yesterday. I'll have a proper look round later today, and see if I can find a version which handles the bzip2 problem too.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58471 - Posted: 9 Mar 2022 | 9:52:48 UTC - in response to Message 58470.

Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned.

____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58472 - Posted: 9 Mar 2022 | 11:45:54 UTC

How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58473 - Posted: 9 Mar 2022 | 14:20:01 UTC - in response to Message 58472.
Last modified: 9 Mar 2022 | 14:43:41 UTC

was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful.


PythonGPU task checkpointing was working before. It was discussed previously in the forum. I tested in locally back then and worked fine. Did it happen to anyone else that checkpointing failed? please let me know in that case


I have sent a small batch of tasks for PythonGPUbeta, to test if some errors on Windows are now solved. Will keep iterating in small batches for the beta app.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58474 - Posted: 9 Mar 2022 | 15:29:20 UTC - in response to Message 58473.
Last modified: 9 Mar 2022 | 15:42:54 UTC

I have a python task for Linux running, recently started.

It's reported that it's checkpointing properly:

CPU time 00:33:10
CPU time since checkpoint 00:01:33
Elapsed time 00:33:27

but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted.

I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see.

Results will be reported in task 32773760 overnight, but I'll post here before that.

Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58475 - Posted: 9 Mar 2022 | 15:37:59 UTC - in response to Message 58474.
Last modified: 9 Mar 2022 | 15:40:32 UTC

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58476 - Posted: 9 Mar 2022 | 16:10:04 UTC - in response to Message 58475.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.

Well, it was the only one I had in a suitable state for testing.
And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495.

The run.log file (which we don't normally get a chance to see) has the ominous line

# WARNING: removed an old file: output.xtc

after a second set of startup details. Perhaps you could pass a message to the appropriate team?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58477 - Posted: 9 Mar 2022 | 16:18:28 UTC - in response to Message 58476.

I will. Thanks a lot for the feedback.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58478 - Posted: 9 Mar 2022 | 23:18:59 UTC - in response to Message 58475.

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.


Yes it was linux.
The % complete I saw was 100%, then a bit later 10% per BOINCTasks.
Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58479 - Posted: 10 Mar 2022 | 10:41:18 UTC

OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with.

The first task is:

<task>
<application>C:\Windows\System32\tar.exe</application>
<command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments.

It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet).

I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58480 - Posted: 10 Mar 2022 | 10:54:35 UTC

OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations.

ToDo: go find the command line I saw yesterday for doing that in a script.
Check the disk usage limits to ensure all that can happen in the slot directory.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58481 - Posted: 10 Mar 2022 | 11:23:07 UTC

And it's worth a try. I'm going to split that task into two:

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

I could have piped them, but - baby steps!

I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58483 - Posted: 10 Mar 2022 | 11:47:57 UTC
Last modified: 10 Mar 2022 | 11:49:40 UTC

I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58484 - Posted: 10 Mar 2022 | 11:50:21 UTC
Last modified: 10 Mar 2022 | 11:56:21 UTC

Damn. Where did that go wrong?

application C:\Windows\System32\tar.exe missing

Anyone else who wants to try this experiment can try https://www.7-zip.org/ - looks as if the license would even allow the project to distribute it.

Edit - I edited the job.xml file while the previous task was finishing, and then stopped BOINC to increase the disk limit. On restart, BOINC must have noticed that the file had changed, and it downloaded a fresh copy. Near miss.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58485 - Posted: 10 Mar 2022 | 13:42:43 UTC
Last modified: 10 Mar 2022 | 14:19:37 UTC

application "C:\Program Files\7-Zip\7z" missing

Make that "C:\Program Files\7-Zip\7z.exe"

Or maybe not.

application "C:\Program Files\7-Zip\7z.exe" missing

Isn't the damn wrapper clever enough to remove the quotes I put in there to protect the space in "Program Files"?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58486 - Posted: 10 Mar 2022 | 15:02:08 UTC

Using tar.exe in W10 and W11 seems to work now.

However, it is true that:

a) some machines do not have tar.exe. My initial idea was that older versions of Windows could donwload tar.exe, but it seems that is does not work.

b) The C:\Windows\System32\tar.exe path is hardcoded. I understand that ideally we should add to PATH all possible paths where this executable could be found right?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58487 - Posted: 10 Mar 2022 | 15:40:34 UTC - in response to Message 58486.

On this particular Windows 7 machine, I have:

PATH=
C:\Windows\system32;
C:\Windows;
C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;;
C:\Program Files\Process Lasso\;

- I've split that into separate lines for clarity. but it's one single environment variable that has been added to by various installers over the years.

For a native Windows system component, I wouldn't have thought a path was necessary at all - Windows should handle all that. That's what path variables are for. But maybe the wrapper app is so dumb that it just throws the exact string it parses from job.xml at a file_open function? I'll have a look at the code.

I've got two remaining thoughts left: try Program [space] Files without any quotes; or stick a copy of 7z.exe in Windows/system32 (although mine's a 64-bit version...), and call it explicitly from there. I don't think it'll have anywhere to hide from that...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58488 - Posted: 10 Mar 2022 | 17:53:57 UTC

Yay! That's what I wanted to see:

17:49:09 (21360): wrapper: running C:\Program Files\7-Zip\7z.exe (x windows_x86_64__cuda1131.tar.gz)

7-Zip [64] 15.14 : Copyright (c) 1999-2015 Igor Pavlov : 2015-12-31

Scanning the drive for archives:
1 file, 2666937516 bytes (2544 MiB)

Extracting archive: windows_x86_64__cuda1131.tar.gz

And I've got v1.04 in my sandbox...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58489 - Posted: 10 Mar 2022 | 18:27:31 UTC

But not much more than that. After half an hour, it's got as far as:

Everything is Ok

Files: 13722
Size: 5270733721
Compressed: 5281648640
18:02:00 (21360): C:\Program Files\7-Zip\7z.exe exited; CPU time 6.567642
18:02:00 (21360): wrapper: running python.exe (run.py)
WARNING: The script shortuuid.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script normalizer.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts wandb.exe and wb.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires attrs>=17.4.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires packaging, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
aiohttp 3.7.4.post0 requires attrs>=17.3.0, which is not installed.
WARNING: The scripts pyrsa-decrypt.exe, pyrsa-encrypt.exe, pyrsa-keygen.exe, pyrsa-priv2pub.exe, pyrsa-sign.exe and pyrsa-verify.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script jsonschema.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script gpustat.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts ray-operator.exe, ray.exe, rllib.exe, serve.exe and tune.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
WARNING: The script f2py.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
wandb: W&B API key is configured (use `wandb login --relogin` to force relogin)
wandb: Appending key for api.wandb.ai to your netrc file: D:\BOINCdata\slots\5/.netrc
wandb: Currently logged in as: rl-team-upf (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in D:\BOINCdata\slots\5\wandb\run-20220310_181709-mxbeog6d
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run MontezumaAgent_e1a12
wandb: View project at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta
wandb: View run at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta/runs/mxbeog6d

and doesn't seem to be getting any further. I'll see if it's moved on after dinner, might might abort it if it hasn't.

Task is 32782603

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58490 - Posted: 10 Mar 2022 | 18:54:03 UTC

Then, lots of iterations of:

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\cudnn_cnn_train64_8.dll" or one of its dependencies.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "D:\BOINCdata\slots\5\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\BOINCdata\slots\5\run.py", line 23, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err

I've increased it ten-fold, but that requires a reboot - and the task didn't survive. Trying one last time, then it's 'No new Tasks' for tonight.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58491 - Posted: 10 Mar 2022 | 19:05:20 UTC

BTW, yes - the wrapper really is that dumb.

https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L727

It just plods along, from beginning to end, copying it byte by byte. The only thing it considers is which way the slashes are pointing.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,541,308,617
RAC: 4,382,658
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58492 - Posted: 11 Mar 2022 | 0:16:42 UTC

I managed to complete 2 of these WUs successfully. They still need a lot of work done. You have low GPU usage, and they cause the boinc manager to be slow and sluggish and unresponsive.

https://www.gpugrid.net/result.php?resultid=32784274

https://www.gpugrid.net/result.php?resultid=32783598

They were pain to finish!!!!!

And what for, only 3000 points for 882 days worth of work per WU!!!!!!



mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58493 - Posted: 11 Mar 2022 | 0:48:05 UTC - in response to Message 58483.

I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132


Disabling python beta on this W10 PC has another 11+ hours gone
https://www.gpugrid.net/result.php?resultid=32780319

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58494 - Posted: 11 Mar 2022 | 8:49:55 UTC - in response to Message 58490.
Last modified: 11 Mar 2022 | 8:59:43 UTC

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58495 - Posted: 11 Mar 2022 | 8:58:52 UTC - in response to Message 58492.

Yes, regarding the workload, I have been testing the tasks with low GPU/CPU usage. I was interested in checking if the conda environment was successfully unpacked and the python script was able to complete a few iterations. It will be increased as soon as this part works, as well as the points.

For the completely wrong duration estimation, I will look into what can be done. I am not sure how BOINC estimates it. Could please someone confirm if it is also wrong in Linux of if it is only a Windows issue?


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58496 - Posted: 11 Mar 2022 | 9:15:30 UTC

Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58497 - Posted: 11 Mar 2022 | 9:28:05 UTC - in response to Message 58494.
Last modified: 11 Mar 2022 | 9:46:21 UTC

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.

I was a bit suspicious about the 'paging file too small' error - I didn't even think Windows applications could get information about what the current setting was. I'd suggest correlating the machines with this error, with their reported physical memory. Mine is 'only' 8 GB - small by modern standards.

It looks like there may be some useful clues in

https://discuss.pytorch.org/t/winerror-1455-the-paging-file-is-too-small-for-this-operation-to-complete/131233

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58498 - Posted: 11 Mar 2022 | 9:34:16 UTC - in response to Message 58496.

Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?

That's certainly a part of it, but it's a very long, complicated, and historical story. It will affect any and all platforms, not just Windows, and other data as well as rsc_fpops_est. And it's also related to historical decisions by both BOINC and GPUGrid.

I'll try and write up some bedtime reading for you, but don't waste time on it in the meantime - there won't be an easy 'magic bullet' to fix it.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58499 - Posted: 11 Mar 2022 | 10:21:10 UTC - in response to Message 58497.

Yes I was looking at the same link. Seems related to limited memory. I might try to run the suggested script before running the job, which seems to mitigate the problem.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58500 - Posted: 11 Mar 2022 | 13:50:18 UTC - in response to Message 58498.

Runtime estimation – and where it goes wrong

The estimates we see on our home computers are made up of three elements. They are:
The SIZE of a task – rsc_fpops_est
The SPEED of the device that’s calculating the result
One or more correction tweaks, designed to smooth off the rough edges.

The original system

In the early days, all BOINC projects ran on CPUs, and almost all the CPUs in use were single-core. The speed of that CPU was measured by a derivation of the Whetstone benchmark: this was originally designed to measure hardware speeds only, and deliberately excluded software optimisation. For scientific research, careful optimisation is a valid technique (provided it isn’t done at the expense of accuracy).

There was a general (but unspoken) assumption that projects would be running a single type of research task, using a single application. So the rough edges were smoothed by something called DCF (duration correction factor). That kept track of that single application, running on that single CPU, and gently adjusted it until the estimates were pretty good. It worked. The adjustments were calculated by, and stored on, the local computer.

The revised system

Starting in 2008, BOINC was adapted to support applications that ran on GPUs – GPUGrid and SETI@home first, others followed. There never was any attempt to benchmark GPUs, so the theoretical baseline speed of a GPU application was taken to be a figure derived from the hardware architecture, notably the number of shaders and the clock speed. This was known as “peak FLOPS”, or – to some of us – “marketing FLOPS”. No way has any programmer ever been able to write a scientific program which uses every clock cycle of every shader, with no overhead for synchronisation or data transfer. Whatever.

At the same time, projects kept their CPU apps running, and many developed multiple research streams using different apps. A single-valued DCF couldn’t smooth off all the different rough edges at the same time.

There’s nothing in principle to stop the BOINC client keeping track of multiple application+device combinations, and such a system was in fact developed by a volunteer. But it was rejected by David Anderson in Berkeley, who devised his own system of Runtime Estimation, keeping track of the necessary tweaks on the project server. This was intended to replace client-based DCFs entirely, although the old system was retained for historical compatibility.

The implications for GPUGrid

As I think we all know, GPUGrid uses rsc_fpops_est, but I don’t think it’s realised quite how fundamental it is to the whole inverted pyramid. If tasks run much faster than their declared fpops, the only conclusion that BOINC can draw is the application speed has suddenly become much faster, and it tries to adapt accordingly.

GPUGrid has kept both of the adjustment methods active. If you look at any of our computer details, you will see that it contains a link to show application details: the smoothed average of all our successful tasks with each application. The critical one here is APR, or ‘average processing rate’. That’s the device+application speed, in GFlops. But on the computer details page, you’ll also see the DCF listed. Nominally, this should be 1, replaced by APR – but here, usually it isn’t.

The implications? APR works adequately for long term, steady, production work. But it fails during periods of rapid change and testing.

1) APR is disregarded entirely when a new application version is activated on the server. It starts again from scratch, and the initial estimates are – questionable. In fact, I don’t have a clue what speed is assumed for the first few tasks allocated.

2) It kicks in in two stages. First, when 100 tasks have been completed for the whole ensemble, and again when each individual computer reaches 11 completed tasks. Note that ‘completed’ here means a normal end-of-run plus a validated result. Some app versions never achieve that!

Different GPUs run at very different speeds, and the first 100 tasks returned normally come back from the fastest cards. That skews the average speed. In the worst case, the first hundred back can set a standard which lesser cards can’t attain – so they are stopped by ‘run time exceeded’, can never achieve the necessary 11 validations to set their own, lower, bar, and are excluded for good. The same can happen if deliberately short test tasks are put through early on, without an adjusted rsc_fpops_est: again, an unfeasibly fast target is set, and no-one can complete full-length tasks.

Sorry – I’ve been called out this afternoon, so I’ve dashed that off much quicker than I intended. I’ll leave it there for now, and we can all discuss the way forward later.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58501 - Posted: 11 Mar 2022 | 21:25:14 UTC - in response to Message 58500.

Thank you very much for the explanation Richard, very helpful actually.

I have been using short tests tasks to catch bugs in the early states of the job. That might have caused problems, although I guess we can adjust rsc_fpops_est and reset statistics later. The idea is to have long term, steady, production work after the tests.

However, I don't fully understand how that could cause estimates of hundreds of days. In any case, the most reliable information for the host is then the progress percentage, which should be correct.

I remember the ‘run time exceeded’ error was happening previously in the app and we had to adjust the rsc_fpops_est parameter. Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? The idea is that PythonGPUbeta eventually becomes the sole Python app, running the same Linux jobs PythonGPU is running now plus Windows jobs.

____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58502 - Posted: 11 Mar 2022 | 21:58:48 UTC - in response to Message 58501.
Last modified: 11 Mar 2022 | 22:01:23 UTC

Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?
This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58505 - Posted: 12 Mar 2022 | 8:56:44 UTC - in response to Message 58502.

Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?
This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them.

Correct.

Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.

In the meantime, we're working through a glut of ACEMD3 tasks, and here's how they arrive:

12/03/2022 08:23:29 | GPUGRID | [sched_op] NVIDIA GPU work request: 11906.64 seconds; 0.00 devices
12/03/2022 08:23:30 | GPUGRID | Scheduler request completed: got 2 new tasks
12/03/2022 08:23:30 | GPUGRID | [sched_op] estimated total NVIDIA GPU task duration: 306007 seconds

So, I'm asking for a few hours of work, and getting several days. Or so BOINC says.

This is Windows host 45218, which is currently showing "Task duration correction factor 13.714405". (It was higher a few minutes ago, when that work was fetched - over 13.84)

I forgot to mention yesterday that in the first phase of BOINC's life, both your server and our clients took account of DCF, so the 'request' and 'estimated' figures would have been much closer. But when the APR code was added in 2010, the DCF code was removed from the servers. So your server knows what my DCF is, but it doesn't use that information.

So the server probably assessed that each task would last about 11,055 seconds. That's why it added the second task to the allocation: it thought the first one didn't quite fill my request for 11,906 seconds.

In reality, this is a short-running batch - although not marked as such - and the last one finished in 4,289 seconds. That's why DCF is falling after every task, though slowly.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58506 - Posted: 12 Mar 2022 | 21:01:37 UTC - in response to Message 58494.

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.


Having tar.exe wasn't enough. I later saw a popup in W10 saying archieveint.dll was missing.

I had two python tasks in linux error out in ~30min with
15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing

That PC has python 2.7.17 and 3.6.8 installed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58508 - Posted: 13 Mar 2022 | 17:19:19 UTC - in response to Message 58505.

Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.

Caught one!

Task e1a5-ABOU_pythonGPU_beta2_test16-0-1-RND7314_1

Host is 43404. Windows 7. It has two GPUs, and GPUGrid is set to run on the other one, not as shown. The important bits are

CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 472.12, CUDA version 11.4, compute capability 7.5, 4096MB, 3032MB available, 5622 GFLOPS peak)

DCF is 8.882342, and the task shows up as:



Why? This is what I got from the server, in the sched_reply file:

<app_version>
<app_name>PythonGPUbeta</app_name>
<version_num>104</version_num>
...
<flops>47361236228.648697</flops>
...
<workunit>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>
...

1,000,000,000,000,000,000 fpops, at 47 GFLOPS, would take 21,114,313 seconds, or 244 days. Multiply in the DCF, and you get the 2170 days shown.

According to the application details page, this host has completed one 'Python apps for GPU hosts beta 1.04 windows_x86_64 (cuda1131)' task (new apps always go right down to the bottom of that page). It recorded an APR of 1279539, which is bonkers the other way - these are GFlops, remember. It must have been task 32782603, which completed in 781 seconds.

So, lessons to be learned:

1) A shortened test task, described as running for the full-run number of fpops, will register an astronomical speed. If anyone completes 11 tasks like that, that speed will get locked into the system for that host, and will cause the 'runtime limit exceeded' error.

2) BOINC is extremely bad - stupidly bad - at generating a first guess for the speed of a 'new application, new host' combination. It's actually taken precisely one-tenth of the speed of the acemd3 application on this machine, which might be taken as a "safe working assumption" for the time being. I'll try to check that in the server code.

Oooh - I've let it run, and BOINC has remembered how I set up 7-Zip decompression last week. That's nice.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58509 - Posted: 13 Mar 2022 | 17:23:05 UTC

But it hasn't remembered the increased disk limit. Never mind - nor did I.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58510 - Posted: 14 Mar 2022 | 8:42:00 UTC - in response to Message 58506.

Right now, the way PythonGPU app works is by dividing the job in 2 subtasks:
1- first, installing conda and creating the conda environment.
2- second, running the python script.

The error

15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing


means that after 1800 seconds, the conda environment was not yet created for some reason. This could be because the conda dependencies could not be downloaded in time or because the machine was running the installation process more slowly than expected. We set this time limit of 30 mins because in theory it is plenty of time to create the environment.

However, in the new version (the current PythonGPUBeta), we send the whole conda environment compressed and simply unpack it in the machine. Therefor this error, which indeed happens every now and then now, should disappear.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58511 - Posted: 14 Mar 2022 | 8:55:03 UTC - in response to Message 58508.

ok, so my plan was to run at least a few more batches of test jobs. Then start the real tasks.

I understand now that if some machines have by then run several test tasks that will create an estimation problem. Does resetting the credit statistics help? Would it be better to create a new app for real jobs once the testing is finished? so statistics are consistent and, in the long term, BOINC estimates better the durations?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58512 - Posted: 14 Mar 2022 | 10:29:52 UTC - in response to Message 58511.

My gut feeling is that it would be better to deploy the finished app (after all testing seems to be complete) as a new app_version. We would have to go through the training process for APR one last time, but then it should settle down.

I've seen the reference to resetting the credit statistics before, but only some years ago in scanning the documentation. I've never actually seen the console screen you use to control a BOINC server, let alone operated one for real, so I don't know whether you can control the reset to a single app_version, or whether you have to nuke the entire project - best not to find out the hard way.

You're right, of course - the whole Runtime Estimation (APR) structure is intimately bound up with the CreditNew tools, also introduced in 2010. So the credit reset is likely to include an APR reset - but I'd hold that back for now.

I see you've started sending out v1.05 betas. One has arrived on one of my Linux machines, and again, the estimated speed is exactly one-tenth of the acemd3 speed - with extreme precision, to the last decimal place:

<flops>707593666701.291382</flops>
<flops>70759366670.129135</flops>

That must be deliberate.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58513 - Posted: 14 Mar 2022 | 11:21:42 UTC - in response to Message 58511.
Last modified: 14 Mar 2022 | 11:22:20 UTC

Would it be better to create a new app for real jobs once the testing is finished?
Based on the last few days' discussion here, I've understood the purpose of the former short and long queue from GPUGrid's perspective:
By separating the tasks into two queues based on their length, the project's staff didn't have to bother setting the rsc_fpops_est value for each and every batch, (note that the same app was assigned to each queue). The two queues had used different (but constant through batches) rsc_fpops_est values, so the runtime estimation of BOINC could not get so much off in each queue that would tigger the "won't finish on time" or the "run time exceeded" situation.
Perhaps this practise should be put in operation again, even on a finer level of granularity (S, M, L tasks, or even XS and XL tasks).

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,541,308,617
RAC: 4,382,658
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58518 - Posted: 14 Mar 2022 | 23:14:08 UTC

I am getting "Disk usage limit exceeded" error.

https://www.gpugrid.net/result.php?resultid=32808038

I do have 400 Gigs reserved for boincs.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58519 - Posted: 15 Mar 2022 | 16:40:36 UTC - in response to Message 58518.

I believe the "Disk usage limit exceeded" error is not related to the machine resources, is defined by an adjustable parameter of the app. The conda environment + all the other files might be over this limit.I will review the current value, we might have to increase it. Thanks for pointing it out!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58524 - Posted: 17 Mar 2022 | 9:59:07 UTC

After a day out running a long acemd3 task, there's good news and bad news.

The good news: runtime estimates have reached sanity, The magic numbers are now

<flops>336636264786015.625000</flops>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>

That ends up with an estimated runtime of about 9 hours - but at the cost of a speed estimate of 336,636 GFlops. That's way beyond a marketing department's dream.

Either somebody has done open-heart surgery on the project's database (unlikely and unwise), or BOINC now has enough completed tasks for v1.05 to start taking notice of the reported values.

The bad news: I'm getting errors again.

ModuleNotFoundError: No module named 'gym'

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58527 - Posted: 18 Mar 2022 | 13:05:44 UTC

v1.06 is released and working (very short test tasks only).

Watch out for:
Another 2.46 GB download
Estimates are back up to multiple years

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58528 - Posted: 18 Mar 2022 | 13:43:25 UTC - in response to Message 58524.

The latest version should fix this error.

ModuleNotFoundError: No module named 'gym'


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58529 - Posted: 18 Mar 2022 | 15:24:55 UTC - in response to Message 58528.
Last modified: 18 Mar 2022 | 16:19:46 UTC

I have task 32836015 running - showing 50% after 30 minutes. That looks like it's giving the maths a good work-out.

Edit - actually, it's not doing much at all.


You should be on NVidia device 1 - but cool, low power, 0% usage. No checkpoint, nothing written to stderr.txt in an hour and a half.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58534 - Posted: 18 Mar 2022 | 16:53:01 UTC - in response to Message 58529.
Last modified: 18 Mar 2022 | 16:54:58 UTC

For now I am just trying to see the jobs finish.. I am not even trying to make them run for a long time. Jobs should not even need checkpoints, should last less than 15 mins.

So weird, some other jobs in Widows machines from the same batch managed to finish. For example those with result ids 32835825, 32836020 or 32835934.

I don't understand why it works in some Windows machines and fails in others. Sometimes without complaining about anything. And works fine locally in my Windows laptop.

Does windows have trouble with multiprocessing? I need to add many more checkpoints in the scripts I guess. Pretty much after every line of code..
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58536 - Posted: 18 Mar 2022 | 17:41:35 UTC - in response to Message 58534.

Err, this particular task is running on Linux - specifically, Mint v20.3

It ran the first short task OK at lunchtime - see Python apps for GPU hosts beta on host 508381. I think I'd better abort it while we think.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,757,047,630
RAC: 1,071,191
Level
Phe
Scientific publications
wat
Message 58537 - Posted: 20 Mar 2022 | 12:08:10 UTC

This task https://www.gpugrid.net/result.php?resultid=32841161 has been running for nearly 26 hours now. It is the first Python beta task I have received that appears to be working. Green-With-Envy shows intermittent low activity on my 1080 GPU and BoincTasks shows 100% CPU usage. It checkpointed only once several minutes after it started and has shown 50% complete ever since.

Should I let this task continue or abort it?

(Linux Mint, 1080 driver is 510.47.03)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58538 - Posted: 20 Mar 2022 | 12:35:55 UTC - in response to Message 58537.

Sounds just like mine, including the 100% CPU usage - that'll be the wrapper app, rather than the main Python app.

One thing I didn't try, but only thought about afterwards, is to suspend the task for a moment and then allow it to run again. That has re-vitalised some apps at other projects, but is not guaranteed to improve things: it might even cause it to fail. But if it goes back to 0% or 50%, and doesn't move further, it's probably not going anywhere. I'd abort it at that point.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,757,047,630
RAC: 1,071,191
Level
Phe
Scientific publications
wat
Message 58539 - Posted: 20 Mar 2022 | 13:12:02 UTC - in response to Message 58538.

Well, after a suspend and allowing it to run, it went back to its checkpoint and has shown no progress since. I will abort it. Keep on learning....

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58540 - Posted: 21 Mar 2022 | 8:21:21 UTC - in response to Message 58538.

ok so it gets stuck at 50%. I will be reviewing it today. Thanks for the feedback.

I also seems to fail in most Windows cases without reporting any error.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58541 - Posted: 21 Mar 2022 | 12:39:20 UTC
Last modified: 21 Mar 2022 | 13:03:38 UTC

Got a new one - the other Linux machine, but very similar. Looks like you've put some debug text into stderr.txt:

12:28:16 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
12:31:39 (482274): /bin/tar exited; CPU time 192.149659
12:31:39 (482274): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Sanity check, make sure that logging matches execution
Check if this is a restarted job
Define Train Vector of Envs
Define RL training algorithm
Look for available model checkpoint in log_dir - node failure case
Define RL Policy
Define rollouts storage
Define scheme

but nothing new has been added in the last five minutes. Showing 50% progress, no GPU activity. I'll give it another ten minutes or so, then try stop-start and abort if nothing new.

Edit - no, no progress. Same result on two further tasks. All the quoted lines are written within about 5 seconds, then nothing. I'll let the machine do something else while I go shopping...

Tasks for host 132158

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58549 - Posted: 21 Mar 2022 | 14:51:37 UTC - in response to Message 58541.
Last modified: 21 Mar 2022 | 15:45:30 UTC

Ok so I have seen 3 main errors in the last batches:


1. The one reported by Bedrich Hajek ("Disk usage limit exceeded"). We have now increased the amount of disk space allotted by BOINC to each task and I believe, based on the last batch I sent, that this error is gone now.


2. The "older" Windows machines do not have the tar.exe application and therefore can not unpack the conda environment. I know Richard did some research into that, but had to download 7-Zip. Ideally I would like the app to be self-contained. Maybe we can send the 7-Zip program with the app, I will have to research if that is possible.

3. The job getting stuck at 50%. I did add some debug messages in the last batches and I believe I know more or less when in the code the script gets stuck. I am still looking into it. Will also check recent results to see if there is any pattern when this error happens. Note there there is no checkpoint because it is a short task that gets stuck, so since the training is not progressing new checkpoints are not getting saved.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58550 - Posted: 24 Mar 2022 | 10:05:39 UTC - in response to Message 58549.
Last modified: 24 Mar 2022 | 10:09:32 UTC

We have updated to a new app version for windows that solves the following error:

application C:\Windows\System32\tar.exe missing


Now we send the 7z.exe (576 KB) file with the app, which allows to unpack the other files without relying on the host machine having tar.exe (which is only in windows 11 and latest builds of windows 10).

I just sent a small batch of short tasks this morning to test and so far it seems to work.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58551 - Posted: 24 Mar 2022 | 10:14:38 UTC

Task 32868822 (Linux Mint GPU beta)

Still seems to be stalling at 50%, after "Define scheme". bin/python run.py is using 100% CPU, plus over 30 threads from multiprocessing.spawn with too little CPU usage to monitor (shows as 0.0%). No GPU app listed by nvidia-smi.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58552 - Posted: 24 Mar 2022 | 10:24:01 UTC - in response to Message 58551.
Last modified: 24 Mar 2022 | 10:26:18 UTC

Do you know by chance if this same machine works fine with PythonGPU tasks even if it fails in the PythonGPUBeta ones?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58553 - Posted: 24 Mar 2022 | 11:01:02 UTC - in response to Message 58552.
Last modified: 24 Mar 2022 | 11:26:25 UTC

Yes, it does. Most recent was:

e1a5-ABOU_rnd_ppod_avoid_cnn13-0-1-RND6436_3

Three failed before me, but mine was OK.

Edit: In relation to that successful task, BOINC only returns the last 64 KB of stderr.txt - so that result starts in the middle of the file (that's the bit that's most likely to contain debug information after a crash). I'll try to capture the initial part of the file next time I run one of those tasks, for reference.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58561 - Posted: 25 Mar 2022 | 8:33:38 UTC
Last modified: 25 Mar 2022 | 8:34:20 UTC

I have also changed a bit the approach.

I have just sent a batch of short tasks much more similar to those in PythonGPU. If these work fine, I will slowly introduce changes to see what was the problem.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58562 - Posted: 25 Mar 2022 | 9:03:09 UTC - in response to Message 58561.

I've grabbed one. Will run within the hour.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58563 - Posted: 25 Mar 2022 | 9:20:46 UTC - in response to Message 58561.
Last modified: 25 Mar 2022 | 9:27:47 UTC

I sent 2 batches,

ABOU_rnd_ppod_avoid_cnn_testing

and

ABOU_rnd_ppod_avoid_cnn_testing2

Unfortunately the first batch will crash. I detected one bug already which I have fixed in the second one. Seems like you got at least one in the second batch ( e1a18-ABOU_rnd_ppod_avoid_cnn_testing2). Running it will give us the info we need.

On the bright side, the fix with 7z.exe seems to work in all machines so far.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58564 - Posted: 25 Mar 2022 | 9:52:52 UTC - in response to Message 58563.

Yes, I got the testing2. It's been running for about 23 minutes now, but I'm seeing the same as yesterday - nothing written to stderr.txt since:

09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
09:32:39 (51456): /bin/tar exited; CPU time 192.380796
09:32:39 (51456): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Define rollouts storage
Define scheme

and machine usage shows



(full-screen version of that at https://i.imgur.com/Ly9Aabd.png)

I've preserved the control information for that task, and I'll try to re-run it interactively in terminal later today - you can sometimes catch additional error messages that way.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58565 - Posted: 25 Mar 2022 | 10:06:50 UTC - in response to Message 58564.

Ok thanks a lot. Maybe then it is not the python script but some of the dependencies.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58566 - Posted: 25 Mar 2022 | 10:27:08 UTC - in response to Message 58565.

OK, I've aborted that task to get my GPU back. I'll see what I can pick out of the preserved entrails, and let you know.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58568 - Posted: 25 Mar 2022 | 18:13:58 UTC

Sorry, ebcak. I copied all the files, but when I came to work on them, several turned out to be BOINC softlinks back to the project directory, where the original file had been deleted. So the fine detail had gone.

Memo to self - don't try to operate dangerous machinery too early in the morning.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,773,367,558
RAC: 159,487
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58569 - Posted: 27 Mar 2022 | 15:49:31 UTC

The past several tasks have gotten stuck at 50% for me as well. Today one has made it past to 57.7% now in 8hours. 1-2% GPU util on 3070Ti. 2.5 CPU threads per BOINCTasks. 3063mb memory per nvidia-smi and 4.4GB per BOINCTasks.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58571 - Posted: 28 Mar 2022 | 16:09:03 UTC
Last modified: 28 Mar 2022 | 17:15:17 UTC

I updated the app. Tested it locally and works fine on Linux.

I sent a batch of test jobs (ABOU_rnd_ppod_avoid_cnn_testing3), which I have seen executed successfully in at least 1 Linux machine so far.

One way check if the job is actually progressing, is to look for a directory called "monitor_logs/train" in the BOINC directory where the job is being executed. If logs are being written to the files inside this folder, means the task is progressing.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58572 - Posted: 28 Mar 2022 | 17:20:54 UTC - in response to Message 58571.

Got a couple on one of my Windows 7 machines. The first - task 32875836 - completed successfully, the second is running now.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58573 - Posted: 28 Mar 2022 | 18:01:06 UTC - in response to Message 58572.

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58574 - Posted: 28 Mar 2022 | 18:55:31 UTC - in response to Message 58573.

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...

Worse is to follow, I'm afraid. task 32875988 started immediately after the first one (same machine, but a different slot directory), but seems to have got stuck.

I now seem to have two separate slot directories:

Slot 0, where the original task ran. It has 31 items (3 folders, 28 files) at the top level, but the folder properties says the total (presumably expanding the site-packages) is 49 folders, 257 files, 3.62 GB

Slot 5, allocated to the new task. It has 93 items at the top level (12 folders, including monitor_logs, and the rest files). This one looks the same as the first one did, while it was actively running the first task. This one has 14 files in the train directory - I think the first only had 4. This slot also has a stderr file, which ends with multiple repetitions of

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\__init__.py", line 1, in <module>
from pytorchrl.agent.env.vec_env import VecEnv
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\vec_env.py", line 1, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

I'm going to try variations on a theme of
- clear the old slot manually
- pause and restart the task
- stop and restart BOINC
- stop and retsart Windows

I'll report back what works and what doesn't.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58575 - Posted: 28 Mar 2022 | 19:36:32 UTC

Well, that was interesting. The files in slot 0 couldn't be deleted - they were locked by a running app 'python' - which is presumably why BOINC hadn't cleaned the folder when the first task finished.

So I stopped the second task, and used Windows Task Manager to see what was running. Sure enough, there was still a Python image, and I still couldn't delete the old files. So I force-stopped that python image, and then I could - and did - delete them.

I restarted the second task, but nothing much happened. The wrapper app posted in stderr that it was restarting python, but nothing else.

So then I restarted BOINC, and all hell broke loose. In quick succession, I got



Then windows crashed a browser tab and two Einstein@Home tasks on the other GPU.

When I'd closed the Python app from the Windows error box, the BOINC task closed cleanly, uploaded some files, and reported a successful finish. It even validated!

Things all seem to be running quietly now, so I think I'll leave this machine alone for a while and think. At the moment, the take-home theory is that the whole sequence was triggered by the failure of the python app to close at the end of the first task's run. That might be the next thing to look at.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58576 - Posted: 28 Mar 2022 | 20:38:27 UTC

Well this beta WU was a weird one:

https://www.gpugrid.net/workunit.php?wuid=27211744

It ran to 50% completion and hung there for 3.5 days so I aborted it. Boinc properties showed it running slot 10 except slot 10 was empty. Top (Fedora35) showed no activity with any GPUGrid WU. Some wrapper or something must have been kept alive and running in the background when the WU quit because the ET counter was incrementing time normally.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58577 - Posted: 29 Mar 2022 | 7:52:03 UTC
Last modified: 29 Mar 2022 | 8:01:53 UTC

Interesting that sometimes jobs work and sometimes get stuck in the same machine.

It also seems to me, based on you info, that something remains running at the end of the job and causes the next job to get stuck. Presumably some python thread.

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58578 - Posted: 29 Mar 2022 | 8:45:27 UTC

I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,541,308,617
RAC: 4,382,658
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58581 - Posted: 29 Mar 2022 | 20:42:53 UTC

Another "Disk usage limit exceeded" error:

https://www.gpugrid.net/result.php?resultid=32876568

And a successful one yesterday:

https://www.gpugrid.net/result.php?resultid=32876288


roundup
Send message
Joined: 11 May 10
Posts: 65
Credit: 10,319,928,875
RAC: 4,085,480
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58582 - Posted: 30 Mar 2022 | 14:07:18 UTC
Last modified: 30 Mar 2022 | 14:11:23 UTC

After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11.
A few observations:
- GPU load only between 4% and 8% with a peak between 50% and 70% every 12 seconds.
- The indicated time remaining in the BOINC Client was way off. It started with >7000 (seven thousand) days.
- 15.000 BOINC credits for 102,296 sec runtime. I assume that will be corrected once the python app is going produtive. EDIT: This runtime indicated on the GPUGrid site is not correct, it was actually less.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 4,282,730,025
RAC: 5,947,065
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58588 - Posted: 31 Mar 2022 | 17:33:27 UTC

These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58590 - Posted: 1 Apr 2022 | 8:59:15 UTC - in response to Message 58582.

Thanks a lot for the feedback:

- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.

- Incorrect time remaining prediction is an issue... it will only be fixed with time once the tasks become stable in duration.. maybe even will be required to create a new app and use this one only to debug.

- Also credits will be corrected yes, for now we will have something similar to what we have in the PythonGPU app.

Starting today I will start sending longer jobs, instead of the super short test jobs I was using just to test the code was working in all OS's and machines.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58591 - Posted: 1 Apr 2022 | 9:05:50 UTC - in response to Message 58588.

Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11.

My main worry now is whether or not the problem of some jobs getting "stuck" and never being completed persists. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Please let me know if you detect this problem in one of your tasks, that would be very helpful!

Incidentally, once the PythonGPUBeta app is stable enough, will replace the current PythonGPU app, which only works for Linux.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58592 - Posted: 1 Apr 2022 | 9:18:50 UTC - in response to Message 58591.

It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58593 - Posted: 1 Apr 2022 | 23:39:38 UTC

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.


I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58594 - Posted: 1 Apr 2022 | 23:50:44 UTC

One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went?

Short Final
Send message
Joined: 26 May 20
Posts: 4
Credit: 185,192,781
RAC: 109,816
Level
Ile
Scientific publications
wat
Message 58597 - Posted: 4 Apr 2022 | 10:56:50 UTC

Can anybody explain credits policy please.
My CPU's running Python app relentlessly for up to 7 days for only 50,000 credits. Yet have received 360,000 credits for the ACEMD 3 after only 42,000 secs (11.6 hrs). Bit skewiff.. see below:

https://www.gpugrid.net/results.php?userid=562496


Task
click for details
Show names Work unit
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
32877811 27214361 590351 1 Apr 2022 | 9:34:34 UTC 3 Apr 2022 | 9:57:48 UTC Completed and validated 309,332.50 309,332.50 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32877804 27214354 581235 1 Apr 2022 | 9:38:33 UTC 3 Apr 2022 | 19:38:13 UTC Completed and validated 628,304.20 628,304.20 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32876508 27207895 581235 29 Mar 2022 | 9:50:08 UTC 1 Apr 2022 | 4:52:45 UTC Completed and validated 101,951.50 100,984.90 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32876455 27213533 581235 29 Mar 2022 | 9:17:17 UTC 29 Mar 2022 | 9:49:31 UTC Completed and validated 12,109.13 12,109.13 3,000.00 Python apps for GPU hosts beta v1.09 (cuda1131)
32876341 27213457 590351 29 Mar 2022 | 4:33:52 UTC 31 Mar 2022 | 6:41:54 UTC Completed and validated 42,830.17 41,435.17 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32875459 27212897 581235 27 Mar 2022 | 2:32:46 UTC 29 Mar 2022 | 9:06:58 UTC Completed and validated 96,228.49 95,544.64 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)

PS: How do I past neat image of above??

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58598 - Posted: 4 Apr 2022 | 13:12:34 UTC - in response to Message 58597.

Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects.

The ones you're worried about seem to be Results for host 581235

The one you're specifically asking about - the Python GPU beta v1.10 - was issued on Friday morning and returned on Sunday evening: it was only on your machine for about 58 hours. The run time of 628,304 seconds is misleading (a duplicate of the CPU time) and an error on this website.

Runtime and credit are still being adjusted, and errors are a common feature of beta testing. Sometimes you win, others (like this one) you lose. I'm sure your comments will be noted before testing is complete.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58599 - Posted: 4 Apr 2022 | 18:32:01 UTC
Last modified: 4 Apr 2022 | 18:33:58 UTC

For some reason I haven't been able to snag any of the Python beta tasks lately.

Just the old stock Python tasks.

Couple of them failed at 30 minutes with the no progress downloading the Python environment after 1800 seconds.

One of the reasons I would like to get the new beta tasks that overcome that issue.

Also found a task at 5 hours and counting at 100% completion and not reporting. Suspended the task and resumed in the hope that would nudge it to report but it just restarted at 10% progress.

[Edit] Looks like the suspend/resume was the trick after all. Uploading now.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58600 - Posted: 5 Apr 2022 | 7:54:43 UTC - in response to Message 58597.

The credits system is proportional to the amount of compute required to complete each task, like in acemd3.

In acemd3, it is proportional to the complexity of the simulation. In python tasks, which train artificial intelligence reinforcement learning agents, is proportional to the amount of interactions between the agent and its simulated environment required for the agent to learn how to behave in it.

At the moment, we give 2000 credits per 1M interactions, and most tasks require 25M training interactions (except test task which are shorter, normally just 1M). Therefore, completing a task gives 50000 credits and 75000 if completed specially fast.

Note that we are in beta phase, and while the credit difference between acemd and pythonGPU jobs should not be huge, we might need to adjust the credits given per 1M interactions to make them equivalent.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58601 - Posted: 5 Apr 2022 | 8:04:42 UTC - in response to Message 58599.
Last modified: 5 Apr 2022 | 8:04:59 UTC

Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues.

We want to wait a bit more in case more bugs are detected, but we will soon update the pythonGPU app with the code from PythonGPUBeta, which seems to work well now. As mentioned, it does not have the problem of installing conda every time (instead downloads the packed environment only the first time). It also works for Linux and Windows.

At that point we will keep PythonGPUBeta only for testing.
____________

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 58602 - Posted: 5 Apr 2022 | 16:52:21 UTC
Last modified: 5 Apr 2022 | 16:53:43 UTC

So far some run well while other ran for 2 and 3 days.
I did abort the ones that are still running after 3 days.
I will pick back up in the Fall and I hope to see good running tasks on my GPU's.
For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58603 - Posted: 5 Apr 2022 | 17:38:10 UTC

Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish.

Been getting nothing but solid Python beta tasks now for the past couple of days.

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,510,442,777
RAC: 762,654
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58604 - Posted: 5 Apr 2022 | 18:04:48 UTC

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58605 - Posted: 5 Apr 2022 | 19:06:38 UTC - in response to Message 58604.

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.


you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver.

but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks.

____________

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,510,442,777
RAC: 762,654
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58606 - Posted: 5 Apr 2022 | 21:38:04 UTC - in response to Message 58605.

Sorry for posting wrong thread.
Changed drivers to 496.49 on other machine too... now just have to wait to get some work to see does it work.

Personally I was really hoping when new things were coming, that this project would ditch the cuda at last and moved to opencl.

No project that I have crunched on opencl have had extended issues like this. And most of those projects run on AMD cards too.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58607 - Posted: 6 Apr 2022 | 0:41:00 UTC - in response to Message 58606.

I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected.

CUDA outperforms OpenCL. even it identical code (as much as it can be), there is always the added overhead of needing to compile the opencl code at runtime. whereas CUDA runs natively on Nvidia. most projects run opencl because it lets them more easily port the code to different devices, expanding their user base at the expense of some performance overhead.

there have been many problems with the 500+ series drivers though. if you still have issues with the older drivers then it's something else wrong with your setup. if you didnt totally purge the old drivers with DDU from Safe Mode and re-install from a fresh nvidia package, that's a good first step. sometimes driver corruption can linger acropss many driver removals and upgrades and it needs to be more forcefully removed.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58608 - Posted: 6 Apr 2022 | 5:23:39 UTC - in response to Message 58602.

bcavnaugh wrote:

... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.

you say it, indeed :-(
Obviously, ACEMD has very low priority at GPUGRID these days :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58609 - Posted: 7 Apr 2022 | 19:23:48 UTC

Beta is still having issues with establishing the correct Python environment.

Threw away around 27 tasks today with errors because of:

TypeError: object of type 'int' has no len()

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58613 - Posted: 8 Apr 2022 | 9:51:42 UTC - in response to Message 58609.

thanks, this is solved now. A new batch is running without this issue.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58614 - Posted: 8 Apr 2022 | 14:43:17 UTC

There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58615 - Posted: 8 Apr 2022 | 17:04:36 UTC

Yes, I am still getting the bad work unit resends.

Too bad they couldn't be purged before hitting the _9 timeout.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58616 - Posted: 11 Apr 2022 | 10:26:59 UTC

New tasks today.

But: "ModuleNotFoundError: No module named 'yaml'"

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58617 - Posted: 11 Apr 2022 | 16:01:42 UTC

Same here today.

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 58618 - Posted: 11 Apr 2022 | 19:04:45 UTC

Same.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58619 - Posted: 12 Apr 2022 | 8:59:28 UTC
Last modified: 12 Apr 2022 | 9:01:22 UTC

Thanks for the feedback. I will look into it today.

In which OS?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58621 - Posted: 12 Apr 2022 | 9:08:46 UTC - in response to Message 58619.

In which OS?

These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58622 - Posted: 12 Apr 2022 | 9:35:11 UTC - in response to Message 58621.
Last modified: 12 Apr 2022 | 9:36:30 UTC

Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta.

This is what I think happened: since in PythonGPU the conda environment is created every time, it could be that some of the dependencies from one or more packages required have changed recently. Therefore, yaml package was not installed in the environments and was missing during execution.

This is one more reason to switch to the new approach (currently beta). The conda environment is created, packed and sent to the volunteer machine when executing the first job. There, the environment is simply unpacked and there is no need to send a new one unless some fix in required.

We will move the PythonGPUBeta app to PythonGPU. Now PythonGPUBeta is quite stable, and its approach avoids this kind of problems. I expect we can do it today, but I will post to confirm it.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58624 - Posted: 12 Apr 2022 | 14:41:23 UTC
Last modified: 12 Apr 2022 | 15:49:53 UTC

The current version of PythonGPUBeta has been copied to PythonGPU

Seems like the task DISK_LIMIT needs to be increased, I have seen some EXIT_DISK_LIMIT_EXCEEDED errors. We will adjust it.
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58625 - Posted: 12 Apr 2022 | 16:48:14 UTC

Well this is interesting to read.
Over at RAH they are using Python (cpu) and they are memory and disk space hogs.
I suggest once you get your GPU tasks working you make a FAQ on minimum memory and disk space needed to run these tasks.

One task in CPU uses 7.8 compressed to 8.4GB actual space on the drive.
Memory wise it uses 2861MB of physical ram and 55 to 58 MB of virtual.
If your tasks for GPU are anything like these...well we will need a bit of free space.

Looking forward to reading about your success getting python running on GPU.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58634 - Posted: 13 Apr 2022 | 7:23:26 UTC - in response to Message 58625.
Last modified: 13 Apr 2022 | 7:42:40 UTC

The size for all the app files (including the compressed environment) are:

2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131

The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?

Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58635 - Posted: 13 Apr 2022 | 9:10:08 UTC

Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02).

If that is not the case, there is probably some problem. It should be automatically used, but if that is not the case resetting the app should help.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58636 - Posted: 13 Apr 2022 | 9:46:58 UTC - in response to Message 58635.
Last modified: 13 Apr 2022 | 9:47:41 UTC

I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app.

Small point: can you apply a "weight" to the sub-tasks in job.xml, please? At the moment, the 'decompress' stage is estimated to take 50% of the runtime under Linux, and 66% under Windows. That throws out the estimate for the rest of the run.

Under Linux, my slot directory is occupying 9.8 GB, against an allowed limit of 10,000,000,000 bytes: that's tight, especially when you consider the divergence of binary and decimal representations for bigger files.

All my predecessors for this workunit were running Windows. Three failed on disk limits, and one on memory limits. If every Windows version is using the 7-zip decompressor, there's the extra 'de-archived, but still compressed' step to allow for in the disk limit.

Still awaiting the final hurdle - the upload file size limit. In about 4 hours' time, I reckon - currently at 85% after 10 hours.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58637 - Posted: 13 Apr 2022 | 10:09:20 UTC - in response to Message 58636.
Last modified: 13 Apr 2022 | 15:03:42 UTC

Thanks a lot for the info Richard!

You are right, I should adjust the weights of the subtasks in job.xml to 10% for 'decompress' and 90% to execute the python script. That maybe also explains why jobs were getting stuck at 50% when python was not closed properly between jobs. The new job could decompress the environment (50%), but the python script could be executed.

I have increased the allowed limit to 30,000,000,000 bytes. This should affect all new jobs (to be confirmed) and should solve the DISK LIMIT problems.

Finally, I was also thinking about sending the compressed environment as a tar.bz2 file instead of a tar.gz to make it smaller. But I have to test that 7-zip handles it correctly.

Probably will deploy these changes first in PythonGPUBeta, that is what it is for
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58638 - Posted: 13 Apr 2022 | 11:25:32 UTC - in response to Message 58637.

I'd say 1%::99%, but thanks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58639 - Posted: 13 Apr 2022 | 13:59:28 UTC

Uploaded and reported with no problem at all.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58640 - Posted: 13 Apr 2022 | 15:16:35 UTC - in response to Message 58639.

has the allowed limit changed to 30,000,000,000 bytes?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58641 - Posted: 13 Apr 2022 | 16:19:28 UTC

Appears so.

<rsc_disk_bound>30000000000.000000</rsc_disk_bound>

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58642 - Posted: 13 Apr 2022 | 19:28:33 UTC - in response to Message 58634.
Last modified: 13 Apr 2022 | 19:30:53 UTC

The size for all the app files (including the compressed environment) are:

2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131

The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?

Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.



Note: I was commenting on Rosetta at home CPU pythons.
What yours do, I don't know. I guess i had better add your project and see what happens.

I readded your project to my system, so if I am home when a task is sent out, I'll have a look.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58643 - Posted: 14 Apr 2022 | 7:36:34 UTC - in response to Message 58642.

Thank you!

I have added the subtask weights to the PythonGPUbeta app. Currently testing it with a small batch of tasks.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58644 - Posted: 14 Apr 2022 | 8:42:41 UTC
Last modified: 14 Apr 2022 | 9:20:16 UTC

Testing was successful, so we can add the weights to the PythonGPU app job.xml file
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58655 - Posted: 15 Apr 2022 | 21:20:06 UTC

abouh,

can you have a look at my comments in a thread I created.
The 4.0 task was not increasing in percentage done after watching it for 10 minutes. Time to completion kept jumping around 1 second up 1 second down.
40 minutes run time vs cpu time? That a hell of a lot of set up time!

Here are the local host task details
Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898
State Running
Received 4/15/2022 12:06:46 PM
Report deadline 4/20/2022 12:06:46 PM
Estimated app speed 53.74 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (GTX 1050)
CPU time at last checkpoint 06:44:35
CPU time 06:47:39
Elapsed time 06:05:04
Estimated time remaining 198d,09:49:25
Fraction done 7.880%
Virtual memory size 7,230.02 MB
Working set size 2,057.87 MB

mikey
Send message
Joined: 2 Jan 09
Posts: 298
Credit: 6,775,345,736
RAC: 2,662,736
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58666 - Posted: 17 Apr 2022 | 20:16:19 UTC - in response to Message 58652.

You can delete the previous post about ACMED3. I posted that incorrectly here.


Some forums let you put a double space or a double period to delete your own post, but you must still do it within the editing time

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58669 - Posted: 18 Apr 2022 | 12:27:00 UTC - in response to Message 58666.

Mikey, I know. But the time limit expired on that post to edit it. I came back days later not within the 30-60 minutes allowed.

Werinbert
Send message
Joined: 12 May 13
Posts: 5
Credit: 100,032,540
RAC: 0
Level
Cys
Scientific publications
wat
Message 58672 - Posted: 18 Apr 2022 | 19:31:43 UTC

I am now running a Python task. It has a very low usage of my GPU most often around 5 to 10%, occasionally getting up to 20%. Is this normal? Should I wait until I move my GPU from an old 3770K to a 12500 computer for better CPU capabilities to do these tasks?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58673 - Posted: 18 Apr 2022 | 23:12:34 UTC - in response to Message 58672.

This is normal for Python on GPU tasks. The tasks run on both the cpu and gpu during parts of the computation for the inferencing and machine learning segments.

Read the posts by the admin developer explaining what the process involves.

- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58674 - Posted: 19 Apr 2022 | 8:21:52 UTC - in response to Message 58655.
Last modified: 19 Apr 2022 | 8:24:36 UTC

Sorry for the late reply Greg _BE, I hid the ACEMD3 posts.

I checked your job e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898. Did the progress get stuck or was it just increasing slowly?

The job was finally completed by another Windows 10 host, but the CPU time is wrong because it says 668566.9 seconds.

I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58675 - Posted: 19 Apr 2022 | 8:25:47 UTC

New tasks being issued this morning, allocated to the old Linux v4.01 'Python app for GPU hosts' issued in October 2021.

All are failing with "ModuleNotFoundError: No module named 'yaml'".

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58676 - Posted: 19 Apr 2022 | 8:38:26 UTC - in response to Message 58674.

I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.

Asking for 1.00 CPUs (or above) would make a significant difference, because that would prompt the BOINC client to reduce the number of tasks being run for other projects.

It would be problematic to increase the CPU demand above 1.00, because the CPU loading is dynamic - BOINC has no provision for allowing another project to utilise the cycles available during periods when the GPUGrid app is quiescent. Normally, a GPU app is given a higher process priority for CPU usage than a pure CPU app, so the operating system should allocate resources to your advantage, but that can be problematic when the wrapper app is in use. That was changed recently: I'll look into the situation with your server version and our current client versions.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58677 - Posted: 19 Apr 2022 | 9:23:32 UTC - in response to Message 58675.
Last modified: 19 Apr 2022 | 9:24:44 UTC

Definitely only the latest version 403 should be sent. Thanks for letting us know.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58678 - Posted: 19 Apr 2022 | 12:01:04 UTC

BOINC GPU apps, wrapper apps, and process priority

The basic rule for BOINC applications (originally CPU only) has been to run applications at idle priority, to avoid interfering with foreground use of the computer.

Since the introduction of GPU apps into BOINC around 2008, the CPU portion of a GPU app has been automatically run at a slightly higher process priority (below normal) - an attempt to avoid highly-productive GPU work being throttled by competition for CPU resources.

Normally, the BOINC client manages these two different process priorities directly. But when a wrapper app is interpolated between the client and a worker app, it's the wrapper which sets the priority for the worker app. It was a user on this project who first noticed (Issue 3764 - May 2020) that the process priority of a GPU app wasn't being set correctly when it was executing under the control of a wrapper app.

Many false starts later (PRs 3826, 3948, 3988, 3999), a fully consistent set of process priority tools was developed, effective from about 25 September 2020.

But in order for these tools to be useful, compatible versions of both the BOINC client and the wrapper application have to be used. So far as I can tell, BOINC client for Windows v7.16.20 (current) is compliant; Wrapper version 26203 is compliant; but no full public release versions of the BOINC client for Linux are yet compliant (Gianfranco Costamagna's prototyping PPA client should be).

This project appears to be using wrapper code 26016 for Windows, and wrapper code 26198 for Linux. Unless these have been patched locally, neither wrapper will yet allow full process control management.

It's not urgent, but with the new Python apps running in a mixed CPU/GPU environment, it might be helpful to update the project's wrapper codebase. Fortunately, the basic server platform is unaffected by all this.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58696 - Posted: 21 Apr 2022 | 15:04:23 UTC - in response to Message 58675.

We have deprecated v4.01

Hopefully, if everything went fine, the error

All are failing with "ModuleNotFoundError: No module named 'yaml'".


should not happen any more. And all jobs should use v4.03
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58752 - Posted: 27 Apr 2022 | 18:52:49 UTC
Last modified: 27 Apr 2022 | 19:41:00 UTC

abouh,

I got another python finally.
But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?
Another observation, elapsed time vs CPU time. The two are off by about 5 hours.
4:01 vs 8:54 currently
Progress is not moving very fast. In the time it has taken me to write this it is stuck at 7.88%
Now 4:16 to 9:24 and still 7.88%!!, 15 mins and no progress? If this hasn't changed in the next hour, I am also aborting this task.
BTW, 46 checkpoints in the 4hrs of run time.

https://www.gpugrid.net/workunit.php?wuid=27219917

Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15

Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID 590211
Run time 241,306.00
CPU time 1,471.50
GeForce RTX 3080 Ti (4095MB) driver: 497.

The point of this information is:

1)I have GTX 1050 and 1080. Previous python failed with the same exit error as the first person in this python task. What is EXIT_CHILD_FAILED? Something on your end or on our end?

2) Person 2 probably aborted because of the way BOINC reads the data to determine the time. I killed my first python because it shows 160+ days to completion.




***I give up. No progress in 30 minutes since I started this post***

Computer: DESKTOP-LFM92VN
Project GPUGRID

Name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256_2

Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256
State Running
Received 4/27/2022 4:35:18 PM
Report deadline 5/2/2022 4:35:18 PM
Estimated app speed 3,171.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (device 1)
CPU time at last checkpoint 09:58:18
CPU time 10:08:59
Elapsed time 04:37:57
Estimated time remaining 161d,06:23:41
Fraction done 7.880%
Virtual memory size 6,429.20 MB
Working set size 1,072.13 MB
Directory slots/12
Process ID 16828

Debug State: 2 - Scheduler: 2

That's 4:01 to 4:38 and still at 7.88%
Checkpoints count up. CPU is 219%
This is all messed up.
I join the abort team.

------------

Something about the other task that failed with exit child.
A few extracts:

wandb: Network error (ReadTimeout), entering retry loop.

Exception in thread StatsThr:
Traceback (most recent call last):
File "D:\data\slots\13\lib\site-packages\psutil\_common.py", line 449, in wrapper
ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'

During handling of the above exception, another exception occurred:
(followed by line this and line that, etc)

And then this:
OSError: [WinError 1455] The paging file is too small for this operation to complete

But the next person who got has this kind of setup:

CPU type AuthenticAMD
AMD Ryzen 5 5600X 6-Core Processor [Family 25 Model 33 Stepping 0]
Number of processors 12
Coprocessors NVIDIA NVIDIA GeForce RTX 3080 (4095MB) driver: 512.15
Operating System Microsoft Windows 11
x64 Edition, (10.00.22000.00

I run GTX and Win10 with a Ryzen 7 2800 and 7.16.20 BOINC

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58753 - Posted: 27 Apr 2022 | 19:35:21 UTC

But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?

Because the task was actually using a little more than two cores to process the work.

Why I have set Python task to allocate 3 cpu threads for BOINC scheduling.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58754 - Posted: 27 Apr 2022 | 19:45:18 UTC - in response to Message 58753.
Last modified: 27 Apr 2022 | 19:46:26 UTC

But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?

Because the task was actually using a little more than two cores to process the work.

Why I have set Python task to allocate 3 cpu threads for BOINC scheduling.


Ok...interesting, but what accounts for the lack of progress in 30 mins on this task that I just killed and the exit child error and blow up on the previous Python?

I mean really...0% with 2 decimal points, 7.88 for more than 30 minutes?
I don't know of any project that can't even 1/100th in 30 minutes.
I've seen my share of slow tasks in other projects, but this one...wow....

And how do you go about setting just python for 3 cpu cores? That's beyond my knowledge level.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58755 - Posted: 27 Apr 2022 | 22:31:48 UTC - in response to Message 58754.

You use an app_config.xml file in the project like this:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58762 - Posted: 28 Apr 2022 | 19:40:48 UTC - in response to Message 58755.

You use an app_config.xml file in the project like this:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


Ok thanks. I will make that file tomorrow or this weekend. To tired to try that tonight.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58767 - Posted: 30 Apr 2022 | 21:23:31 UTC - in response to Message 58696.

We have deprecated v4.01
Hopefully, if everything went fine, the error
All are failing with "ModuleNotFoundError: No module named 'yaml'".
should not happen any more. And all jobs should use v4.03

I've recently reset Gpugrid project at every of my hosts, but I've still received v4.01 at several of them, and failed with the mentioned error.
Some subsequent v4.03 resends for the same tasks have eventually succeeded at other hosts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58768 - Posted: 1 May 2022 | 1:18:29 UTC - in response to Message 58767.
Last modified: 1 May 2022 | 1:19:13 UTC

Unfortunately the admins never yanked the malformed tasks from distribution.

They only will disappear when they hit the 7th (_6) resend and it fails. Then it will be pulled from distribution. (Too many errors (may have bug))

I've had a lot of the bad Python 4.01 tasks also but thankfully a lot of them were at the tail end of distribution.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58770 - Posted: 3 May 2022 | 8:56:12 UTC - in response to Message 58752.
Last modified: 3 May 2022 | 9:23:28 UTC

Sorry for the late reply Greg _BE, I was away for the last 5 days. Thank you very much for the detailed report.

----------

1. Regarding this error:

Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15


Seems like the process failed after raising the exception: "The wandb backend process has shutdown". wandb is the python package we use to send out logs about the agent training process. It provides useful information to better understand the task results. Seems like the process failed and then the whole task got stuck, that is why no progress was being made. Since it reached 7.88% progress, I assume it worked well until then. I need to review other jobs to see why this could be happening and if it happened in other machines. We had not detected this issue before. Thanks for bringing it up.

----------

2. Time estimation is not right for now due to the way BOINC makes it, Richard provided a very complete explanation in a previous posts. We hope it will improve over time... for now be aware that is it completely wrong.

----------

3. Regarding this error:

OSError: [WinError 1455] The paging file is too small for this operation to complete

It is related to using pytorch in windows. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback
We are applying this solution to mitigate the error, but for now it can not be eliminated completely.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58771 - Posted: 3 May 2022 | 8:59:20 UTC - in response to Message 58768.
Last modified: 3 May 2022 | 8:59:31 UTC

Seems like deprecating the version v4.01 did not work then... I will check if there is anything else we can do to enforce usage of v4.03 over the old one.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58772 - Posted: 3 May 2022 | 15:03:59 UTC - in response to Message 58771.
Last modified: 3 May 2022 | 15:05:20 UTC

You need a to send a message to all hosts when they connect to the scheduler to delete the 4.01 application from the host physically and to delete the entry in the client_state.xml file

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58773 - Posted: 3 May 2022 | 15:36:10 UTC
Last modified: 3 May 2022 | 15:37:13 UTC

I sent a batch which will fail with

yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'


It is just an error with the experiment configuration. I immediately cancelled the experiment and fixed the configuration, but the tasks were already sent.

I am very sorry for the inconvenience. Fortunately the jobs will fail right after starting, so no need to kill them. The another batch contains jobs with the fixed configuration.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58774 - Posted: 3 May 2022 | 16:28:10 UTC

I was not getting too many of the python work units, but I recently received/completed one. I know they take... a while to complete.

Specifically, I am looking at task 32892659, work unit 27222901.

I am glad it completed, but it was a long haul.

It was mentioned that "completing a task gives 50000 credits and 75000 if completed specially fast"

How fast do these need to complete for 75000? I am not saying I have the fastest processors but they are definitely not slow (they are running at ~3GHz with the boost) and the GPUs are definitely not slow.

Thanks!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58775 - Posted: 3 May 2022 | 19:15:29 UTC - in response to Message 58774.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58776 - Posted: 3 May 2022 | 19:22:50 UTC - in response to Message 58775.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58777 - Posted: 3 May 2022 | 20:02:01 UTC - in response to Message 58776.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?


these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58778 - Posted: 3 May 2022 | 21:36:29 UTC - in response to Message 58777.

these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.

The critical point being that they aren't declared to BOINC as needing multiple cores, so BOINC doesn't automatically clear extra CPU space for them to run in.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58779 - Posted: 4 May 2022 | 8:20:54 UTC - in response to Message 58778.

Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58780 - Posted: 4 May 2022 | 8:22:06 UTC - in response to Message 58777.
Last modified: 4 May 2022 | 8:24:20 UTC

yes, the tasks run 32 agent environments in parallel python processes. Definitely the bottleneck could be the CPU because BOINC is not aware of it.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58781 - Posted: 4 May 2022 | 11:57:25 UTC

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58782 - Posted: 4 May 2022 | 12:17:13 UTC - in response to Message 58781.

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?


Only if you have more than 64 threads per GPU available and you stop processing of any existing CPU work.
____________

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 4,282,730,025
RAC: 5,947,065
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58783 - Posted: 4 May 2022 | 14:31:38 UTC

abouh asked

Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side


I tried that, but boinc manager on my pc will overallocate CPU's. I am currently running multicore atlas cpu tasks from lhc alongside the python tasks from gpugrid. The atlas tasks are set to use 8 CPU's and the python tasks are set to use 10 CPU's. The example for this response is on an AMD cpu with 8 cores/16 threads. BOINC is set to use 15 threads. It will run one gpugrid python 10 thread task and one lhc 8 thread task at the same time. That is 18 threads running on a 15 thread cpu.

Here is my app_config for gpugrid:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>PythonGPUbeta</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>Python</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
</app_config>


And here is my app_config for lhc:

<app_config>
<app>
<name>ATLAS</name>
<cpu_usage>8</cpu_usage>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>8</avg_ncpus>
<cmdline>--nthreads 8</cmdline>
</app_version>
</app_config>


If anyone has any suggestions for changes to the app_config files, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58785 - Posted: 4 May 2022 | 17:39:52 UTC - in response to Message 58781.
Last modified: 4 May 2022 | 17:41:36 UTC

I can run 2 jobs manually on my machine with 12 CPUs, in parallel. They are slower than a single job, but much faster than running them sequentially.

Specially since the jobs iterate between using CPU and using GPU. 2 jobs won't be completely synchronous so as long as the GPU has enough memory.

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58786 - Posted: 4 May 2022 | 19:09:06 UTC - in response to Message 58785.

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.

Normally, the user's BOINC client will assign the GPU device number, and this will be conveyed to the job by the wrapper.

You can easily run two jobs per GPU (both with the same device number), and give them both two full CPU cores each, by using an app_config.xml file including

...
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>2.0</cpu_usage>
</gpu_versions>
...

(full details in the user manual)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58788 - Posted: 5 May 2022 | 7:34:11 UTC - in response to Message 58786.
Last modified: 5 May 2022 | 7:34:33 UTC

I see, thanks for the clarification
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58789 - Posted: 5 May 2022 | 22:33:28 UTC
Last modified: 5 May 2022 | 22:34:23 UTC

I guess I am going to have to give up on this project.
All I get is exit child errors. Every single task.
For example: https://www.gpugrid.net/result.php?resultid=32894080

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58790 - Posted: 6 May 2022 | 7:14:02 UTC - in response to Message 58789.

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 137
Credit: 122,677,395
RAC: 21,115
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58791 - Posted: 6 May 2022 | 15:52:36 UTC - in response to Message 58790.

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.



ok...waiting in line for the next batch.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58830 - Posted: 20 May 2022 | 16:42:10 UTC - in response to Message 58778.

I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58831 - Posted: 20 May 2022 | 19:47:55 UTC - in response to Message 58830.

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,541,308,617
RAC: 4,382,658
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58832 - Posted: 20 May 2022 | 21:46:06 UTC - in response to Message 58831.

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.



They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now.

There is always next fall to fix this.....................




Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58833 - Posted: 20 May 2022 | 21:53:38 UTC - in response to Message 58831.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more.

This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time.

The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole.

Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58834 - Posted: 20 May 2022 | 23:09:30 UTC - in response to Message 58832.

I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts.

I get at least a couple a day per host for the past several weeks.

Nothing like a month ago when there were a thousand or so available.

I doubt we ever return to the production of years ago unfortunately.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58844 - Posted: 23 May 2022 | 7:49:06 UTC - in response to Message 58830.
Last modified: 24 May 2022 | 7:30:27 UTC

The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that.

We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.

It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58845 - Posted: 23 May 2022 | 8:42:25 UTC - in response to Message 58832.

We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them.

However, please note that in research at some point we need to start doing experiments (I want to talk more about that in my next post). Further testing and fixing is required and we are committed to do it. This takes a long time, so we need to work in both things in parallel. We will still use the beta app to test new versions.

Please, if you are talking about a recurring specific problem in your machines, let me know and will look into it.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58846 - Posted: 23 May 2022 | 8:44:31 UTC - in response to Message 58844.

I'm away from my machines at the moment, but can confirm that's the case.

Look at task 32897902. Reported time 108,075.00 seconds (well over a day), but got 75,000 credits. It was away from the server for about 11 hours. GTX 1660, Linux Mint.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58847 - Posted: 23 May 2022 | 9:21:48 UTC - in response to Message 58834.
Last modified: 23 May 2022 | 9:23:30 UTC

I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively.

To recap a bit about what we are doing, we are experimenting with populations of machine learning agents, trying to figure out how important are social interactions and information sharing for intelligent agents. More specifically, we train multiple agents for periods of time in different GPUGrid machines, which later return to the server to report their results. We are researching what kind of information they can share and how to build a common knowledge base, similar to what we humans do. Following, new generations of the populations repeat the process, but already equipped with the knowledge distilled by previous generations.

At the moment we have several experiments running with population sizes of 48 agents, that means a batch of 48 agents every 24-48h. We also have one experiment of 64 agents and one of 128. To my knowledge no recent paper has tried with more than 80, and we plan to keep increasing the population sizes to figure out how relevant that is for agent intelligent behavior. Ideally I would like to reach population sizes of 256, 512 and 1024.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58848 - Posted: 23 May 2022 | 14:00:55 UTC - in response to Message 58833.

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58853 - Posted: 23 May 2022 | 21:18:38 UTC - in response to Message 58848.

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?



No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58855 - Posted: 23 May 2022 | 23:41:45 UTC

BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task.

But BOINC does not handle MT tasks correctly either for that matter.

Blame it on the BOINC code which is old. Like it knows how to handle a task on a single cpu core and that is about all it gets right.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58856 - Posted: 24 May 2022 | 6:26:28 UTC - in response to Message 58853.

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Actually, that line (from the client job log) actually is a useful source of information. It contains both

ct 3544023.000000

which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used.

and et 117973.295733

That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done.

I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58858 - Posted: 24 May 2022 | 17:24:20 UTC

OK, so here is a back of the napkin calculation on how long the task actually took to crunch

Take the et time from the job_log entry for the task and divide by 32 since the tasks spawn 32 processes on the cpu to account for the way that BOINC calculates cpu_time accumulated across all cores crunching the task.

So 117973.295733 / 32 = 3686.665491656 seconds

or in reality a little over an hour to crunch.

That agrees with the wall clock time (reported - sent) times I have been observing for the shorty demo tasks that are currently being propagated to hosts.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58859 - Posted: 24 May 2022 | 18:02:39 UTC - in response to Message 58858.

Well, since there's also a 'nm' (name) field in the client job log, we can find the rest:

Task 32897743, run on host 588658.

Because it's a Windows task, there's a lot to digest in the std_err log, but it includes

04:44:21 (34948): .\7za.exe exited; CPU time 9.890625
04:44:21 (34948): wrapper: running python.exe (run.py)

13:32:28 (7456): wrapper (7.9.26016): starting
13:32:28 (7456): wrapper: running python.exe (run.py)
(that looks like a restart)
Then some more of the same, and finally

14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58860 - Posted: 24 May 2022 | 18:32:40 UTC


14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)

So 2816214 / 32 = 88006 seconds

88006 / 3600 seconds = 24.44 hours

That is close to matching the received time minus the sent time of a little over a day.

The task did'nt get the full 50% credit bonus for returning within 24 hours but did get the 25% bonus.

I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58861 - Posted: 25 May 2022 | 16:55:51 UTC - in response to Message 58860.




I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.


That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58862 - Posted: 25 May 2022 | 17:09:27 UTC

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase." (message 58590). Any instantaneous observation won't reveal the full situation: either CPU will be high, and GPU low, or vice-versa.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 7,254,068,306
RAC: 475,233
Level
Tyr
Scientific publications
wat
Message 58863 - Posted: 25 May 2022 | 19:35:37 UTC - in response to Message 58862.

Yep- I observe the alternation. When I suspend all other work units, I can see that just one of these tasks will use a little more than half of the logical processors. I know it has been talked about that although it says it uses 1 processor (or, 0.996, to be exact) that it uses more. I am running E@H work units and I think that running both is choking the CPU. Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am not sure there is a way to post images, but here are some links to show CPU and GPU usage when only running one python task. Is it supposed to use that much of the CPU?

https://i.postimg.cc/Kv8zcMGQ/CPU-Usage1.jpg
https://i.postimg.cc/LX4dkj0b/GPU-Usage-1.jpg
https://i.postimg.cc/tRM0PZdB/GPU-Usage-2.jpg

I am sorry for all of the questions.... just trying my best to understand.


Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58864 - Posted: 25 May 2022 | 19:37:27 UTC - in response to Message 58862.

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase.

This can be very well graphically noticed at the following two images.

Higher CPU - Lower GPU usage cycle:


Higher GPU - Lower CPU usage cycle:


CPU and GPU usage graphics follow an anti cyclical pattern.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58865 - Posted: 26 May 2022 | 1:17:49 UTC - in response to Message 58863.

Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am sorry for all of the questions.... just trying my best to understand.


No there isn't as the user. These are not real MT tasks or any form that BOINC recognizes and provides some configuration options.

Your only solution is to only run one at a time via an max_concurrent statement in an app_config.xml file and then also restrict the number of cores being allowed to be used by your other projects.

That said, I don't know why you are having such difficulties. Maybe chalk it up to Windows, I don't know.

I run 3 other cpu projects at the same times as I run the GPUGrid Python on GPU tasks with 28-46 cpu cores being occupied by Universe, TN-Grid or yoyo depending on the host. Every host primarily runs Universe as the major cpu project.

No impact on the python tasks while running the other cpu apps.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 584
Credit: 10,640,776,387
RAC: 11,984,410
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58866 - Posted: 26 May 2022 | 12:51:52 UTC - in response to Message 58865.

No impact on the python tasks while running the other cpu apps.

Conversely, I notice a performance loss on other CPU tasks when python tasks are in execution.
I processed yesterday python task e7a30-ABOU_rnd_ppod_demo_sharing_large-0-1-RND2847_2 at my host #186626
It was received at 11:33 UTC, and result was returned on 22:50 UTC
At the same period, PrimeGrid PPS-MEGA CPU tasks were also being processed.
The medium processing time for eighteen (18) PPS-MEGA CPU tasks was 3098,81 seconds.
The medium processing time for 18 other PPS-MEGA CPU tasks processed outside that period was 2699,11 seconds.
This represents an extra processing time of about 400 seconds per task, or about a 12,9% performance loss.
There is not such a noticeable difference when running Gpugrid ACEMD tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58867 - Posted: 26 May 2022 | 17:59:57 UTC

I also notice an impact on my running Universe tasks. Generally adds 300 seconds to the normal computation times when running in conjunction with a python task.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 4,282,730,025
RAC: 5,947,065
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58871 - Posted: 28 May 2022 | 2:03:28 UTC

Windows 10 machine running task 32899765. Had a power outage. When the power came back on, task was restarted but just sat there doing nothing. The stderr.txt file showed the following error:

file pythongpu_windows_x86_64__cuda102.tar
already exists. Overwrite with
pythongpu_windows_x86_64__cuda102.tar?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


Task was stalled waiting on a response.

BOINC was stopped and the pythongpu_windows_x86_64__cuda102.tar file was removed from the slots folder.

Computer was restarted then the task was restarted. Then the following error message appeared several times in the stderr.txt file.

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


Page file size was increased to 64000MB and rebooted.

Started task again and still got the error message about page file size too small. Then task abended.

If you need more info about this task, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58876 - Posted: 28 May 2022 | 16:41:56 UTC - in response to Message 58871.

Thank you captainjack for the info.


1.

Interesting that the job gets stuck with:

(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


The job command line is the following:

7za.exe pythongpu_windows_x86_64__cuda102.tar -y


and I got from the application documentation (https://info.nrao.edu/computing/guide/file-access-and-archiving/7zip/7z-7za-command-line-guide):

7-Zip will prompt the user before overwriting existing files unless the user specifies the -y


So essentially -y assumes "Yes" on all Queries. Honestly I am confused by this behaviour, thanks for pointing it out. Maybe I am missing the x, as in

7za.exe x pythongpu_windows_x86_64__cuda102.tar -y


I will test it on the beta app.




2.

Regarding the other error

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


is related to pytorch and nvidia and it only affects some windows machines. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback

TL;DR: Windows and Linux treat multiprocessing in python differently, and in windows each process commits much more memory, especially when using pytorch.

We use the script suggested in the link to mitigate the problem, but it could be that for some machines memory is still insufficient. Does that make sense in your case?


____________

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 4,282,730,025
RAC: 5,947,065
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58878 - Posted: 29 May 2022 | 14:19:12 UTC

Thank you abouh for responding,

I looked through my saved messages from the task to see if there was anything else I could find that might be of value and couldn't find anything.

In regard to the "out of memory" error, I tried to read through the stackoverflow link about the memory error. It is way above my level of technical expertise at this point, but it seemed like the amount of nvidia memory might have something to do with it. I am using an antique GTX970 card. It's old but still works.

Good luck coming up with a solution. If you want me to do any more testing, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58879 - Posted: 31 May 2022 | 8:20:29 UTC
Last modified: 31 May 2022 | 8:21:09 UTC

Seems like here are some possible workarounds:

https://github.com/Spandan-Madan/Pytorch_fine_tuning_Tutorial/issues/10

basically, two users mentioned


I think I managed to solve it (so far). Steps were:

1)- Windows + pause key
2)- Advanced system settings
3)- Advanced tab
4)- Performance - Settings button
5)- Advanced tab - Change button
6)- Uncheck the "Automatically... BLA BLA" checkbox
7)- Select the System managed size option box.


and

If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA.
I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory.


Maybe it can be helpful for someone
____________

bibi
Send message
Joined: 4 May 17
Posts: 15
Credit: 16,302,257,990
RAC: 10,855,198
Level
Trp
Scientific publications
watwatwatwatwat
Message 58880 - Posted: 1 Jun 2022 | 19:13:37 UTC - in response to Message 58876.

Hi abouh,

is there a commandline like
7za.exe pythongpu_windows_x86_64__cuda102.tar.gz
without -y to get pythongpu_windows_x86_64__cuda102.tar ?

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,510,442,777
RAC: 762,654
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58881 - Posted: 1 Jun 2022 | 20:23:47 UTC

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58883 - Posted: 6 Jun 2022 | 9:49:06 UTC - in response to Message 58880.

The command line

7za.exe pythongpu_windows_x86_64__cuda102.tar.gz


works fine if the job is executed without interruptions.

However, in case the job is interrupted and restarted later, the command is executed again. Then, 7za needs to know whether or not to replace the already existing files with the new ones.

The flag -y is just to make sure the script does not get stuck in that command prompt waiting for an answer.

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58884 - Posted: 6 Jun 2022 | 10:29:20 UTC - in response to Message 58881.

Unfortunately recent versions of PyTorch do not support all GPU's, older ones might not be compatible...

Regarding this error

RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125


does it happen recurrently in the same machine? or depending on the job?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58886 - Posted: 6 Jun 2022 | 20:18:53 UTC - in response to Message 58881.

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.


The problem is not with the card but with the Windows environment.

I have no issues running the Python on GPU tasks in Linux on my 1080 Ti card.

https://www.gpugrid.net/results.php?hostid=456812

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58906 - Posted: 11 Jun 2022 | 19:18:07 UTC

Well so far, these new python WU's have been consistently completing and even surviving multiple reboots, OS kernel upgrades, and OS upgrades:

Kernels --> 5.17.13
OS Fedora35 --> Fedora36

3 machines w/GTX-1060 510.73.05

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58907 - Posted: 11 Jun 2022 | 20:31:43 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.

Very nice compared to the acemd3/4 tasks which will error out under similar circumstance.

The Python tasks create and reread checkpoints very well. Upon restart the task will show 1% completion but after a while jump forward to the point that the task was stopped, exited or suspended and continue on till the finish.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58915 - Posted: 12 Jun 2022 | 15:36:57 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58919 - Posted: 12 Jun 2022 | 18:37:36 UTC

I haven't had any reason to carry a cache. I have my cache level set at only one task for each host as I don't want GPUGrid to monopolize my hosts and compete with my other projects.

That said, I haven't gone 12 hours without a Python task on every host at all times.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58920 - Posted: 12 Jun 2022 | 18:40:52 UTC - in response to Message 58915.

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.


BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.

The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58936 - Posted: 17 Jun 2022 | 16:55:57 UTC

BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.

The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.


I think BOINC uses the CPU excluively for their Estimated Time to Completion algorithm all WU's including those using a GPU which makes sense since the job cannot complete until both processor's work are complete. Observing GPU work with E@H, it appears that the GPU finishes first and the CPU continues for a period of time to do what is necessary to wrap the job up for return and those BOINC ETC's are fairly accurate.

It is the multi-thread WU's mentioned that appears to be throwing a monkey wrench at the ETC like these python jobs. From my observations, the python WU's use 32 processes regardless of actual system configuration. I have 2 Ryzen 16 core and my old FX-8350 8 core and they each run 32 processes each WU. It seems to me that the existing algorithm could be used in a modular fashion by assuming a single thread CPU job for the MT WU then calculating the estimated time and then knowing the number of processes the WU is requesting compared with those available from the system, it could perform a simple division and produce a more accurate result for MT WU's as well. Don't know for sure, just speculating but I do have the BOINC source code and might take a look and see if I can find the ETC stuff. Might be interseting.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58937 - Posted: 17 Jun 2022 | 17:57:47 UTC - in response to Message 58936.

The server code for determining the ETC for MT tasks also has to account for task scheduling.

If it was adjusted as you suggest, anytime a Python task would run on the host, the server would proclaim it severely overcommitted and prevent any other work from running or worse, would actually prevent the Python task from running as it prevents other work from running from other projects in accordance with resource share and round-robin scheduling algorithm in the server and client.

It is a mess already with MT work, I believe it would be even worse accounting for these mixed platform cpu-gpu Python tasks.

But go ahead and look at the code. Also you should raise an issue on BOINC's Github repository so that the problem is logged and can be tracked for progress.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58943 - Posted: 18 Jun 2022 | 18:52:40 UTC

You make a good point regarding the server side issues. Perhaps the projects themselves, if not already, would submit desired resources to allow the server to compare with those available on clients similar to submitting in house cluster jobs. I also agree that it is probably best to go through BOINC git and get a request for a potential fix but I also want to see their ETC algorithms just out of curiousity, both server and client. Nice interesting discussion.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58944 - Posted: 18 Jun 2022 | 20:04:43 UTC

You need to review the code in the /client/work_fetch.cpp module and any of the old closed issues pertaining to use of max_concurrent statements in app_config.xml.

I've have posted many conversations on this issue and collaborated with David Anderson and Richard Haselgrove to understand the issue and have seen at least six attempts to fix the issue once and for all.

A very complicated part of the code. You might also want to review many of the client emulator bug-fix runs done on this topic.

https://boinc.berkeley.edu/sim_web.php

The meat of the issue was in PR's #2918 #3001 #3065 #3076 #4117 and #4592

https://github.com/BOINC/boinc/pull/2918

Focus on the round-robin scheduling part of the code.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58949 - Posted: 20 Jun 2022 | 1:14:28 UTC

Thank you Keith, much appreciated background and starting points.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58961 - Posted: 26 Jun 2022 | 6:44:33 UTC

need advice with regard to running Python on one of my Windows machines:

One one of the Windows systems with a GTX980Ti, CPU Intel i7-4930K, 32GB RAM, Python runs well.
GPU memory usage is almost constant at 2.679MB, system memory usage varies between ~1.300MB and ~5.000MB. Task runtime between ~510.000 and ~530.000 secs.

Other Windows system with two RTX3070, CPU Intel i9-10900KF, 64GB RAM out of which 32GB are used for Ramdisk, leaving 32GB system RAM.
When trying to download Python tasks, BOINC event log says that some 22GB more RAM are needed.
How come?
From what I see from the other machine, Python uses between 1.3GB and 5GB RAM.

What can I do in order to get the machine with the two RTX3070 download and crunch Python tasks?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58962 - Posted: 26 Jun 2022 | 7:09:04 UTC - in response to Message 58961.

BOINC event log says that some 22GB more RAM are needed.

Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58963 - Posted: 26 Jun 2022 | 7:30:14 UTC - in response to Message 58962.

BOINC event log says that some 22GB more RAM are needed.

Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it.

here is the text of the log message:

26.06.2022 09:20:35 | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU
26.06.2022 09:20:37 | GPUGRID | Scheduler request completed: got 0 new tasks
26.06.2022 09:20:37 | GPUGRID | No tasks sent
26.06.2022 09:20:37 | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.
26.06.2022 09:20:37 | GPUGRID | Project requested delay of 31 seconds


the reason why at this point it says I have 10.982MB available is because I currently have some LHC projects running which use some RAM.
However, it also says: I need 33.378MB RAM; so my 32GB RAM are not enough anyway (as seen on the other machine, on which I also have 32GB RAM, and there is no problem with downloading and crunching Python).

What I am surprised about is that the projects request so much free RAM, alhough while in operation, it uses only between 1.3 and 5GB.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58964 - Posted: 26 Jun 2022 | 8:06:41 UTC - in response to Message 58963.

26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.

Disk, not RAM. Probably one or other of your disk settings is blocking it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58965 - Posted: 26 Jun 2022 | 8:21:42 UTC - in response to Message 58964.

26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.

Disk, not RAM. Probably one or other of your disk settings is blocking it.

Oh sorry, you are perfectly right. My mistake, how dumm :-(

so, with my 32GB Ramdisk it does not work, when it says that it needs 33378MB.

What I could do, theoretically, is to shift BOINC from the Ramdisk to the 1 GB SSD. However, the reason why I installed BOINC on the Ramdisk was that the LHC Atlas tasks which I am crunching permanently have an enormous disk usage, and I don't want ATLAS to kill the SSD too early.

I guess that there might be ways to install a second instance of BOINC on the SSD - I tried this on another PC years ago, but somehow I did not get it done properly :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58966 - Posted: 26 Jun 2022 | 9:32:13 UTC - in response to Message 58965.

You'll need to decide which copy of BOINC is going to be your 'primary' installation (default settings, autorun stuff in the registry, etc.), and which is going to be the 'secondary'.

The primary one can be exactly what is set up by the installer, with one change. The easiest way is to add the line

<allow_multiple_clients>1</allow_multiple_clients>

to the options section of cc_config.xml (or set the value to 1 if the line is already present). That needs a client restart if BOINC's already running.

Then, these two batch files work for me. Adapt program and data locations as needed.

To run the client:
D:\BOINC\rh_boinc_test --allow_multiple_clients --allow_remote_gui_rpc --redirectio --detach_console --gui_rpc_port 31418 --dir D:\BOINCdata2\

To run a Manager to control the second client:
start D:\BOINC\boincmgr.exe /m /n 127.0.0.1 /g 31418 /p password

Note that I've set this up to run test clients alongside my main working installation - you can probably ignore that bit.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58968 - Posted: 30 Jun 2022 | 15:56:37 UTC - in response to Message 58844.

We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.

It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.

Are you still in need of that? My first Python ran for 12 hours 55 minutes according to BoincTasks, but the website reported 156,269.60 seconds (over 43 hours). It got 75,000 credits.
http://www.gpugrid.net/results.php?hostid=593715

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58969 - Posted: 1 Jul 2022 | 13:10:32 UTC - in response to Message 58968.
Last modified: 1 Jul 2022 | 13:11:38 UTC

Thanks for the feedback Jim1348! It is useful for us to confirm that jobs run in a reasonable time despite the wrong estimation issue. Maybe that can be solved somehow in the future. Seems like at least did no estimate dozens of days like I have seen in other occasions.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58970 - Posted: 1 Jul 2022 | 13:33:55 UTC - in response to Message 58969.

it's because the app is using the CPU time instead of runtime. since it uses so many threads, it adds up the time spent on all the threads. 2 threads working for 1hr total would be 2hrs reported CPU time. you need to track wall clock time. the app seems to have this capability since it reports timestamps of start and stop in the stderr.txt file.

also credit reward is static, and should be a more dynamic scheme like the acemd3 tasks. look at Jim's tasks you have tasks with a 2,000 - 150,000 seconds (reported) all with the same 75,000 credit reward. good reward for the 2,000s runs, but painfully low for the longer ones (the majority).
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58971 - Posted: 1 Jul 2022 | 13:55:40 UTC - in response to Message 58970.

There are two separate problems with timing.

There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.

And there's the estimation of anticipated runtime when a task is first issued, before it's even started to run. I would have thought that would have started to correct itself by now: with the steady supply of work recently, we will have got well past all the trigger points for the server averaging algorithms.

Next time I see a task waiting to run, I'll trap the numbers and try to make sense of them.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 58972 - Posted: 1 Jul 2022 | 14:56:15 UTC - in response to Message 58971.



There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.



that may be true, NOW. however, if they move to a dynamic credit scheme (as they should) that awards credit based on flops and runtime (like ACEMD3 does), then the runtime will not be just cosmetic.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58973 - Posted: 1 Jul 2022 | 17:27:17 UTC - in response to Message 58971.

OK, I got one on host 508381. Initial estimate is 752d 05:26:18, task is 32940037

Size:
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>

Speed:
<flops>707646935000.048218</flops>

DCF:
<duration_correction_factor>45.991658</duration_correction_factor>

App_ver:
<app_name>PythonGPU</app_name>
<version_num>403</version_num>

Host details:
Number of tasks completed 80
Average processing rate 13025.358204684

Calculated time estimate (size / speed):
1413134.079355548 [seconds]
16.355718511 [days - raw]
752.226612105 [days - adjusted by DCF]

So my client is doing the calculations right.

The glaring difference is between flops and APR.

Re-doing the {size / speed} calculation with APR gives
76773.320494203 [seconds]
21.32592236 [hours]

which is a little high for this machine, but not bad. The last 'normal length' tasks ran in about 14 hours.

So, the question is: why is the server tracking APR, but not using it in the <app_version> blocks sent to our machines?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58974 - Posted: 2 Jul 2022 | 9:04:06 UTC

Yesterday's task is just in the final stages - it'll finish after about 13 hours - and the next is ready to start. So here are the figures for the next in the cycle.

Initial estimate: 737d 06:19:25
<flops>707646935000.048218</flops>
<duration_correction_factor>45.076802</duration_correction_factor>
Average processing rate 13072.709605774

So, APR and DCF have both made a tiny movement in the right direction, but flops has remained stubbornly unchanged. And that's the one that controls the initial estimates.

(actually, a little short one crept in between the two I'm watching, so it's two cycles - but that doesn't change the principle)

roundup
Send message
Joined: 11 May 10
Posts: 65
Credit: 10,319,928,875
RAC: 4,085,480
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58975 - Posted: 3 Jul 2022 | 10:17:13 UTC

The credits per runtime for cuda1131 really looks strange sometimes:

Task 27246643 2 Jul 2022 | 8:13:32 UTC 3 Jul 2022 | 8:20:56 UTC
Runtimes 445,161.60 445,161.60 Credits 62,500.00

Compare to this one:
Task 27246622 2 Jul 2022 | 7:55:03 UTC 2 Jul 2022 | 8:05:39 UTC Runtimes 2,770.92 2,770.92 Credits 75,000.00

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58977 - Posted: 4 Jul 2022 | 13:54:08 UTC - in response to Message 58970.

Yes, you are right about that. There are 2 types of experiments I run now:

a) Normal experiments have tasks with a fixed target number of agent-environment interaction to process. The tasks finish once this number of interactions is reached. All tasks require the same amount of compute, then makes sense (at least to me) to reward them with the same amount of credits. Even if some tasks are completed in less time due to faster hardware.

b) I have recently introduced an "early stopping" mechanism to some experiments. The upper bound is the same as in the other type of experiments, a fixed amount of agent-environment interactions. However, if the agent discovers interesting results before that, it returns so this information can be shared with other agent in the population of AI agents. Which agents will finish earlier and how much earlier is random, so it would be interesting to adjust the credits dynamically, yes. I will ask acemd3 people how to do it.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58978 - Posted: 4 Jul 2022 | 16:51:10 UTC - in response to Message 58975.
Last modified: 5 Jul 2022 | 10:47:37 UTC

The credit system gives 50.000 credits per task. However, completion before a certain amount of time multiplies this value by 1.5, then by 1.25 for a while and finally by 1.0 indefinitely. That explains why sometimes you see 75.000 and sometimes 62.500 credits.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 440,521,294
RAC: 12,337
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58979 - Posted: 6 Jul 2022 | 22:59:17 UTC

I had a idea after reading some of the post about utilisation of resources.

For the power user here we tend to have high end hardware on the project so would it be possible to support our hardware fully e.g I imagine that’s if you have 10-24 GB of VRAM the whole simulation could be loaded in to VRAM giving additional performance to the project?

Additionally the more modern cards have more ML focused hardware accelerated features so are they well utilised?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58980 - Posted: 7 Jul 2022 | 11:10:44 UTC - in response to Message 58979.
Last modified: 7 Jul 2022 | 11:11:36 UTC

The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.

There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...

I am not sure if I am answering your question, please let me know if I am not.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 440,521,294
RAC: 12,337
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58981 - Posted: 7 Jul 2022 | 19:40:48 UTC - in response to Message 58980.

Thanks for the comments, what about using large quantity of VRAM if available, the latest BOINC finally allows for correct reporting VRAM on NVidia cards so you can tailor the WUs based on VRAM to protect the contributions from users with lower specification computers.

FritzB
Send message
Joined: 7 Apr 15
Posts: 12
Credit: 2,784,207,771
RAC: 52,658
Level
Phe
Scientific publications
wat
Message 58995 - Posted: 10 Jul 2022 | 8:22:33 UTC

Sorry for OT, but some people need admin help and I've seen one beeing active here :)

Password reset doesn't work and there seems to be an alternative method some years ago. Maybe this can be done again?

Please have a look in this thread: http://www.gpugrid.net/forum_thread.php?id=2587&nowrap=true#58958

Thanks!
Fritz

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59002 - Posted: 12 Jul 2022 | 7:38:42 UTC - in response to Message 58995.

Hi Fritz! Apparently the problem is that sending emails from server no longer works. I will mention the problem to the server admin.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59003 - Posted: 15 Jul 2022 | 9:26:40 UTC - in response to Message 58995.

I talked to the server admin and he explained to me the problem in more detail.

The issue comes from the fact that the GPUGrid server uses a public IP from the Universitat Pompeu Fabra, so we have to comply with the data protection and security policies of the university. Among other things this implies that we can not send emails from our web server.

Therefore, unfortunately that prevents us from fixing the password recovery problem.




____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59004 - Posted: 15 Jul 2022 | 9:46:45 UTC - in response to Message 58981.

Hello Toby,


For the python app, do you mean executing a script that automatically detects how much memory has the GPU to which the the task has been assigned, and then flexibly define an agent that uses it all (or most of it)? In other words, flexibly adapt to the host machine capacity.

The experiments we are running at the moment require training AI agents in a sequence of jobs (i.e. starting to training an agent in a GPUGrid job, then sending it back to the server to evaluate its capabilities, then send another job that loads the same agent and continues its training, evaluate again, etc)

Consequently, current jobs are designed to work with a fixed amount of GPU memory, and we can not set it too high since we want a high percentage of hosts the be able to run them.

However, it is true that by doing that we are sacrificing resources in GPUs with larger amounts of memory. You gave me something to think about, there could be situations is which could make sense to use this approach and indeed would be a more efficient use of resources.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 440,521,294
RAC: 12,337
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59006 - Posted: 15 Jul 2022 | 16:46:27 UTC - in response to Message 59004.

BOINC can detect the quantity of GPU memory, it was bugged in the older BOINC version for nVidia cards but in 7.20 its fixed so there would be no need to detect in Python as its already in the project database.

A variable job size, yes.

Its more work for you but I can imagine there could be performance boost? Too keep it simple you could have S,M,L with say <4, 4-8, >8? the GPUs with more than 8 could be larger in general as only the top tier GPU's have this much VRAM.

It seems BOINC knows how to allocate to suitable computers. Worst case you could make it opt in.

Profile JohnMD
Avatar
Send message
Joined: 4 Dec 10
Posts: 5
Credit: 26,860,106
RAC: 0
Level
Val
Scientific publications
watwatwat
Message 59007 - Posted: 15 Jul 2022 | 20:20:42 UTC - in response to Message 58981.

Even video cards with 6GiB crash with insufficient VRAM.
The app is apparently not aware of available resources.
This ought to be the first priority before sending tasks to the world.

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,740,982,209
RAC: 1,059,532
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59008 - Posted: 15 Jul 2022 | 20:47:00 UTC - in response to Message 59007.

From what we are finding right now the 6GB GPUs would have sufficient VRAM to run the current Python tasks. Refer to this thread noting between 2.5 and 3.2 GB being used:https://www.gpugrid.net/forum_thread.php?id=5327

If jobs running on GPUs with 4GB or more are crashing, then there is a different problem. Have to look at the logs to see what's going on.

It's more likely they are running out of system memory or swap space but there are a few that are failing from an unknown cause.

I took a quick look at the jobs you have which errored and I found the mx150 and mx350 GPUs only have 2GB VRAM. These are not sufficient to run the Python app.

Unfortunately I would suggest you use these GPUs for another project they are more suited for.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59039 - Posted: 28 Jul 2022 | 9:18:17 UTC
Last modified: 28 Jul 2022 | 9:29:40 UTC

New generic error on multiple tasks this morning:

TypeError: create_factory() got an unexpected keyword argument 'recurrent_nets'

Seems to affect the entire batch currently being generated.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59040 - Posted: 28 Jul 2022 | 9:41:25 UTC - in response to Message 59039.
Last modified: 28 Jul 2022 | 9:42:38 UTC

Thanks for letting us know Richard. It is a minor error, sorry for the inconvenience, I am fixing it right now. Unfortunately the remaining jobs of the batch will crash but then will be replaced with correct ones.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59042 - Posted: 28 Jul 2022 | 10:45:43 UTC

No worries - these things happen. The machine which alerted me to the problem now has a task 'created 28 Jul 2022 | 10:33:04 UTC' which seems to be running normally.

The earlier tasks will hang around until each of them has gone through 8 separate hosts, before your server will accept that there may have been a bug. But at least they don't waste much time.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59043 - Posted: 28 Jul 2022 | 13:38:06 UTC - in response to Message 59042.

Yes exactly, it has to fail 8 times... the only good part is that the bugged tasks fail at the beginning of the script so almost no computation is wasted. I have checked and some of the tasks in the newest batch have already finished successfully.
____________

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 768,747,044
RAC: 139,794
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59071 - Posted: 6 Aug 2022 | 19:47:50 UTC

A peculiarity of Python apps for GPU hosts 4.03 (cuda1131):

If BOINC is shut down while such a task is in progress, then restarted, the task will show 2% progress at first, even if it was well past this before the shutdown.

However, the progress may then jump past 98% at the next time a checkpoint is written, which looks like the hidden progress is recovered.

Not a definite problem, but you should be aware of it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59076 - Posted: 7 Aug 2022 | 14:08:51 UTC

I've been monitoring and playing with the initial runtime estimates for these tasks.



The Y-axis has been scaled by various factors of 10 to make the changes legible.

The initial estimates (750 days to 230 days) are clearly dominated by the DCF (real numbers, unscaled).

The <flops> - the speed of processing, 707 or 704 GigaFlops, assumed by the server. There's a tiny jump midway through the month, which correlates with a machine software update, including a new version of BOINC, and reboot. That will have triggered a CPU benchmark run.

The DCF (client controlled) has been falling very, very, slowly. It's so far distant from reality that BOINC moves it at an ultra-cautious 1% of the difference at the conclusion of each successful run. The changes in slope come about because of the varying mixture of short-running (early exit) tasks and full-length tasks.

The APR has been wobbling about, again because of the varying mixture of tasks, but seems to be tracking the real world reasonably well. The values range from 13,000 to nearly 17,000 GigaFlops.

Conclusion:

The server seems to be estimating the speed of the client using some derivative of the reported benchmark for the machine. That's absurd for a GPU-based project: the variation in GPU speeds is far greater than the variation of CPU speeds. It would be far better to use the APR, but with some safeguards and greater regard to the actual numbers involved.

The chart was derived from host 508381, which has a measured CPU speed of 7.256 GigaFlops (roughly one-tenth of the speed assumed by the server), and all tasks were run on the same GTX 1660 Ti GPU, with a theoretical ('peak') speed of 5,530 GigaFlops. Congratulations to the GPUGrid programmers - you've exceeded three times the speed of light (according to APR)!

More seriously, that suggests that the 'size' setting for these tasks (fpops_est) - the only value that project actually has to supply manually - is set too low. This may have been the point at which the estimates started to go wrong.

One further wrinkle: BOINC servers can't fully allow for varying runtimes and early task exits. Old hands will remember the problems we had with 'dash-9' (overflow) tasks at SETI@home. We overcame that one by adding an 'outlier' pathway to the server code: if the project validator marks the task as an outlier, its runtime is disregarded when tracking APR - that keeps things a lot more stable. Details at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59077 - Posted: 7 Aug 2022 | 16:05:07 UTC - in response to Message 59076.

or just use the flops reported by BOINC for the GPU. since it is recorded and communicated to the project. and from my experience (with ACEMD tasks) does get used in the credit reward for the non-static award scheme. so the project is certainly getting it and able to use that value.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59078 - Posted: 7 Aug 2022 | 17:08:51 UTC - in response to Message 59077.

Except:

1) A machine with two distinct GPUs only reports the peak flops of one of them. (The 'better' card, which is usually - but not always - the faster card).
2) Just as a GPU doesn't run at 10x the speed of the host CPU, it doesn't run realistic work at peak speed, either. That would involve yet another semi-realistic fiddle factor. And Ian will no doubt tell me that fancy modern cards, like Turing and Ampere, run closer to peak speed than earlier generations.

We need to avoid having too many moving parts - too many things to get wrong when the next rotation of researchers takes over.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59099 - Posted: 11 Aug 2022 | 22:40:09 UTC - in response to Message 59078.

personally I'm a big fan of just standardizing the task computational size and assigning static credit. no matter the device used or how long it takes. just take flops out of the equation completely. that way faster devices get more credit/RAC based on the rate in which valid tasks are returned.

the only caveat is the need to make all the tasks roughly the same "size" computationally. but that seems easier than all the hoops to jump through to accommodate all the idiosyncrasies of BOINC, various systems, and task differences.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59101 - Posted: 12 Aug 2022 | 1:45:16 UTC
Last modified: 12 Aug 2022 | 1:49:13 UTC

The latest Python tasks I've done today have awarded 105,000 credits as compared to all the previous tasks at 75,000 credits.

Looking back from the job_log, the estimated computation size has been at 1B GFLOPS for quite a while now.

Nothing has changed in the current task parameters as far as I can tell.

Estimated computation size
1,000,000,000 GFLOPs

So I assume that Abouh has decided to award more credits for the work done.

Anyone notice this new award level?

They are generally taking longer to crunch than the previous ones, so maybe it is just scaling.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59102 - Posted: 12 Aug 2022 | 1:50:33 UTC - in response to Message 59101.

Anyone notice this new award level?

I just got my first one.
http://www.gpugrid.net/workunit.php?wuid=27270757

But not all the new ones receive that. A subsequent one received the usual 75,000 credit.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59104 - Posted: 12 Aug 2022 | 3:31:18 UTC - in response to Message 59102.

Thanks for your report. It doesn't really track with scaling now that I examine my tasks.

Some are getting the new higher reward for 2 hours of computation but some are still getting the lower reward for 8 hours of computation.

I was getting what was the standard reward for tasks taking as little as 20 minutes of computation time. So the 75K was a little excessive in my opinion.

These new ones are trending at 2-3 hours of computation time. But I also had one take 11 hours and was still rewarded with only the 105K.

Maybe we are finally getting into the meat of the AI/ML investigation after all the initial training we have been doing.

Still sitting on 3 new acemd3 tasks that haven't been looked at for two days and will only get the standard reward since the client scheduler feels no need to push them to the front since their APR and estimated completion times are correct and reasonable. Really would like to get the Python tasks to get realistic APR's and estimated completion times. But since they are predominately a cpu task with a little bit of gpu computation, BOINC has no clue how to handle them.

Maybe Abouh can post some insight as to what the current investigation is doing.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59105 - Posted: 12 Aug 2022 | 6:30:00 UTC - in response to Message 59104.

My first 'high rate' task (105K credits) was a workunit created at 10 Aug 2022 | 2:03:51 UTC.

Since then, I've only received one 75K task: my copy was issued to me at 10 Aug 2022 | 21:15:47 UTC, but the underlying workunit was created at 9 Aug 2022 | 13:44:09 UTC - I got a resend after two previous failures by other crunchers.

My take is that the 'tariff' for GPUGrid tasks is set when the underlying workunit is created, and all subsequent tasks issued from that workunit inherit the same value.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59107 - Posted: 12 Aug 2022 | 15:28:04 UTC - in response to Message 59105.

That implies the current release candidates are being assigned 105K credit based I assume on harder to crunch datasets.

Don't think it depends on a recent release date either. I just had a 12 August _0 created task and it only awarded 75K after passing through one other before I got it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 59109 - Posted: 13 Aug 2022 | 17:36:27 UTC
Last modified: 13 Aug 2022 | 17:40:02 UTC

Which apps are running these days? The apps page is missing the column that shows how much is running: https://www.gpugrid.net/apps.php
How many CPU threads do I need to run to finish Python WUs in a reasonable time for say an i9-9980XE?
Trying to update my app_config to give a it a go. The last one I found was pretty old. Here's what I've cobbled together. Suggestions welcome.

<app_config>
<!-- i9-10980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>acemd3</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>acemd4</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>PythonGPUbeta</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>Python</name>
<plan_class>cuda1121</plan_class>
<cpu_usage>4</cpu_usage>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
</app_config>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59110 - Posted: 13 Aug 2022 | 19:32:17 UTC

I get away with only reserving 3 cpu threads. That does not impact or affect what the actual task does when it runs. Just BOINC cpu scheduling for other projects.

It will always spawn 32 independent python processes when running.

And you really should update or remove the plan class statements for Python on GPU since your plan_class is incorrect.

Current plan_class is cuda1131 NOT cuda1121

You also can clean up your app_config as there only is PythonGPU application. No Python or PythonGPUBeta application.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59111 - Posted: 14 Aug 2022 | 22:20:08 UTC
Last modified: 14 Aug 2022 | 22:23:14 UTC

Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example

So the question is - is that intended?..

I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_-

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59112 - Posted: 15 Aug 2022 | 2:29:50 UTC - in response to Message 59111.
Last modified: 15 Aug 2022 | 2:37:41 UTC

Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example

So the question is - is that intended?..

I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_-


Yes, because of flaws in Windows memory management, that effect cannot be gotten around. You need to increase the size of your pagefile to the 50GB range to be safe.

Linux does not have the problem and no changes are necessary to run the tasks.
The project primarily develops Linux applications first as the development process is simpler. Then they tackle the difficulties of developing a Windows application with all the necessary workarounds.

Just the way it is. For the reason why read this post.
https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59113 - Posted: 15 Aug 2022 | 3:04:54 UTC - in response to Message 59112.
Last modified: 15 Aug 2022 | 3:21:22 UTC

Thank you for clarification.
I was not familiar with subtleties of the memory allocation mechanism in Windows.
That was useful.
And I already increase swap to RAM value(64GB) to be sure ;)


Upd.
And the reward system for this app clearly begs for revision... : /

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59114 - Posted: 15 Aug 2022 | 3:27:39 UTC
Last modified: 15 Aug 2022 | 3:33:03 UTC

Task credits are fixed. Pay no attention to the running times. BOINC completely mishandles that since it has no recognition of the dual nature of these cpu-gpu application tasks.

They should be thought of as primarily a cpu application with a little gpu use thrown in occasionally.

[Edit] Look at the delta between sent time and returned time to determine the actual runtime that the task took.

In your example, the first listed task took only 20 minutes to finish, the second took 4 1/2 hours and the last took 4 hours. it all depends on the different parameter sets for each task that is the criteria for the reinforcement learning on the gpu.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59115 - Posted: 15 Aug 2022 | 11:22:21 UTC

Can anyone tell me what happened to this task:
https://www.gpugrid.net/result.php?resultid=32997605

which failed after 301.281 seconds :-(((

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59116 - Posted: 15 Aug 2022 | 11:36:03 UTC - in response to Message 59115.

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.

It's possibly the Windows swap file settings, again.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59117 - Posted: 15 Aug 2022 | 14:51:53 UTC - in response to Message 59116.

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.

It's possibly the Windows swap file settings, again.

thanks Richard for the quick reply.
I now changed the page file size to max. 65MB.
I did it on both drives: system drive C:/ and drive F:/ (on separate SSD) on which BOINC is running.
Probably to change it for only one drive would have been okay, right? If so, which one?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59118 - Posted: 15 Aug 2022 | 15:45:00 UTC - in response to Message 59117.

The Windows one.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59119 - Posted: 15 Aug 2022 | 16:10:21 UTC

I am a bit surprised that I am able to run the pythons without problem under Ubuntu 20.04.4 on a GTX 1060. It has 3GB of video memory, and uses 2.8GB thus far. And the CPU is currently running two cores (down from the previous four cores), using about 3.7GB of memory, though reserving 19 GB.

Even on Win10, my GTX 1650 Super has had no problems, though it has 4GB of memory and uses 3.6GB. But I have 32GB system memory, and for once I let Windows manage the virtual memory itself. It is reserving 42GB. I usually set it to a lower value.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59120 - Posted: 15 Aug 2022 | 16:53:08 UTC - in response to Message 59118.

The Windows one.

thx :-)

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 440,521,294
RAC: 12,337
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59141 - Posted: 20 Aug 2022 | 15:08:53 UTC

Can the CPU usage be adjusted correctly? its fine to use a number of cores but currently it say less than one and uses more than 1

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59143 - Posted: 22 Aug 2022 | 8:48:21 UTC - in response to Message 59107.
Last modified: 22 Aug 2022 | 10:44:01 UTC

Hello! sorry for the late reply

I adjusted the maximum length of some of the tasks and consequently also adjusted the credits for completing them. What I mean by that is that each one of my tasks contains an agent interacting with its environment and learning from a fixed number of total interaction steps. Previously I set that number to 25M steps. Now I increased it to 35M for some tasks and consequently also increased the reward.



This increase in the number of steps does not necessarily increase the completion time of the task, because if an agent discovers something relevant before reaching the maximum number of steps, the task ends and the “new information” is sent back to be shared with the other agents in the population. Whether that happens or not is random, but on average the task completion time will increase a bit due to the ones that reach 35M steps, so the reward has to increase as well. This change does not affect hardware requirements.

This randomness also explains why some tasks are shorter but still receive the same reward (credits per task are fixed). However, the average credit reward should be similar for all hosts as they solve more and more tasks. Also the average task completion time should remain stable.

As I have mentioned, I work with populations of AI agents that try to cooperatively solve a single complex problem. Note that as more things are discovered by agents in a population the harder it becomes to keep discovering new ones. In general, early tasks in an experiment return quite fast, while as the experiment progresses the 35M steps mark gets hit more and more often (and tasks take longer to complete).
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59144 - Posted: 22 Aug 2022 | 10:16:58 UTC - in response to Message 59076.
Last modified: 22 Aug 2022 | 10:24:44 UTC

current value of rsc_fpops_est is 1e18, with 10e18 as limit. I remember we had to increase it because otherwise produced false “task aborted by host” from some users side. Do you think we should change it again?

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59145 - Posted: 22 Aug 2022 | 10:56:47 UTC - in response to Message 59144.
Last modified: 22 Aug 2022 | 10:57:35 UTC

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...

This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.

I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.

The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.

I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.

I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59152 - Posted: 23 Aug 2022 | 18:13:17 UTC - in response to Message 59145.
Last modified: 23 Aug 2022 | 18:20:18 UTC

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...

This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.

I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.

The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.

I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.

I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running.


Could you tell us a bit more about this manual override? Just now it is sprawled over five cores, ten threads. If it sees the sixth core free, it grabs that one also.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59153 - Posted: 23 Aug 2022 | 19:39:42 UTC - in response to Message 59152.

If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.

Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.

The file minimally just needs this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.

I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59154 - Posted: 24 Aug 2022 | 7:15:27 UTC - in response to Message 59153.

If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.

Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.

The file minimally just needs this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.

I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task.


Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59155 - Posted: 24 Aug 2022 | 8:04:26 UTC - in response to Message 59154.

Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN?

Yes - or nbody at MilkyWay. This Python task shares characteristics of a cuda (GPU) plan class, and a MT (multithreaded) plan class, and works best if treated as such.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59163 - Posted: 25 Aug 2022 | 10:12:52 UTC

Possible bad workunit: 27278732

ValueError: Expected value argument (Tensor of shape (1024,)) to be within the support (IntegerInterval(lower_bound=0, upper_bound=17)) of the distribution Categorical(logits: torch.Size([1024, 18])), but found invalid values:
tensor([ 7, 9, 7, ..., 10, 9, 3], device='cuda:0')

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59178 - Posted: 30 Aug 2022 | 7:34:11 UTC - in response to Message 59163.

Interesting I had never seen this error before, thank you!
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 440,521,294
RAC: 12,337
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59192 - Posted: 3 Sep 2022 | 10:27:16 UTC - in response to Message 59145.

Thanks Richard, is 3 CPU cores enough to not slow down the GPU?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59203 - Posted: 8 Sep 2022 | 16:53:31 UTC

I'm noticing an interesting difference in application behavior between different systems. abouh, can you help explain the reason?

I can see that each running task will spawn 32x processes (multiprocessing.spawn) as well as [number of cores]x processes for the main run.py application.

so on my 8-core/16-thread Intel system, a single running task spawns 8x run.py processes, and 32x multiprocessing.spawn threads.

and on my 24-core/48-thread AMD EPYC system, a single running task spawns 24x run.py processes, and 32x multiprocessing.spawn threads.


What is confusing is the utilization of each thread between these systems.

the EPYC system is uses ~600-800% CPU for the run.py process (~20-40% each thread)
whereas the Intel system uses ~120% CPU for the run.py process (~2-5% each thread)

I replicated the same high CPU use on another EPYC system (in a VM) where I've constrained it to the same 8-core/16-thread, and again its using a much larger share of the CPU than the intel system.

is the application coded in some way that will force more work to be done on more modern processors? as far as I can tell, the increased CPU use isnt making the overall task run any faster. the Intel system is just as productive with far less CPU use.

I was trying to run some python tasks on my Plex VM to let it use the GPU since plex doesnt use it very much, but the CPU use is making it troublesome.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59204 - Posted: 8 Sep 2022 | 17:33:16 UTC - in response to Message 59203.

or perhaps the Broadwell based Intel CPU is able to hardware accelerate some tasks that the EPYC has to do in software, leading to higher CPU use?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59205 - Posted: 9 Sep 2022 | 6:40:53 UTC - in response to Message 59203.

The application is not coded in any specific way to force more work to be done on more modern processors.

Maybe python handles it under the hood somehow?
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59206 - Posted: 9 Sep 2022 | 12:37:59 UTC - in response to Message 59205.

Maybe python handles it under the hood somehow?


it might be related to pytorch actually. I did some more digging and it seems like AMD has worse performance due to some kind of CPU detection issue in the MKL (or maybe deliberate by Intel). do you know what version of MKL your package uses?

and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.


____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59207 - Posted: 9 Sep 2022 | 14:45:37 UTC - in response to Message 59206.
Last modified: 9 Sep 2022 | 14:59:12 UTC



and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.



to add: I was able to inspect your MKL version as 2019.0.4, and I tried setting the env variable by adding

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"


to the run.py main program, but it had no effect. either I didn't put the command in the right place (I inserted it below line 433 in the run.py script), or the issue is something else entirely.

edit: you also might consider compiling your scripts into binaries to prevent inquisitive minds from messing about in your program ;)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59208 - Posted: 10 Sep 2022 | 3:15:17 UTC
Last modified: 10 Sep 2022 | 3:29:52 UTC

Should the environment variable for fixing AMD computation in the MKL library be in the task package or just in the host environment? Or both?

I would have thought the latter as the system calls the MKL library is using eventually have to be passed through to the cpu.

export MKL_DEBUG_CPU_TYPE=5

and add to your .bashrc script.

So you need to set the OS environent variable up first then pass it through to the Python code with your os.environ("MKL_DEBUG_CPU_TYPE")

Of course if the embedded MKL package is the later version where the variable is ignored now, a moot point of using the variable to fix the intentional hamstringing of AMD processors.

[Edit]

Looks like there is a workaround for the Intel MKL check whether it is running on an Intel processor. https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

So make the fake shared library and use LD_PRELOAD= to load the fake shared library

That might be the easiest method to get the math libraries to use the advanced SIMD instructions like AVX2.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59209 - Posted: 10 Sep 2022 | 4:34:43 UTC - in response to Message 59208.
Last modified: 10 Sep 2022 | 4:40:03 UTC

I didn’t explicitly state it in my previous reply. But I tried all that already and it didn’t make any difference. I even ran run.py standalone outside of BOINC to be sure that the env variable was set. Neither the env variable being set nor the fake Intel library made any difference at all.

But the embedded MKL version is actually an old one. It’s from 2019 as I mentioned before. So it should accept the debug variable. I just think now that it’s probably not the reason.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59210 - Posted: 10 Sep 2022 | 5:00:10 UTC

Ohh . . . . OK. Didn't know you had tried all the previous existing fixes.

So must be something else going on in the code I guess.

Just thought I would throw it out there in case you hadn't seen the other fixes.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59211 - Posted: 10 Sep 2022 | 6:49:18 UTC - in response to Message 59207.

I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.

No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59212 - Posted: 10 Sep 2022 | 7:48:33 UTC

Don't know if the math functions being used by the Python libraries are any higher than SSE2 or not.

But if they are the MKL library functions default to SSE2 only when the MKL library is called and it detects any NON-Intel cpu.

Probably only way to know for sure is examine the code and see it tries to run any SIMD instruction higher than SSE2, then implement the fix and see if the computations on the cpu are sped up.

Depending on the math function being called, the speedup with the fix in place can be orders of magnitude faster.

Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.

But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59213 - Posted: 10 Sep 2022 | 11:46:05 UTC - in response to Message 59211.

I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.

No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)


Was my location for the variable in the script right or appropriate? inserted below line 433. Does the script inherit the OS variables already? Just wanted to make sure I had it set properly. I figured the script runs in its own environment outside of BOINC (in Python). That’s why I tried adding it to the script.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59214 - Posted: 10 Sep 2022 | 11:51:57 UTC - in response to Message 59212.


Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.

But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.


It’s hard to say whether it’s faster or not since it’s not a true apples to apples comparison. So far it feels not faster, but that’s against different CPUs and different GPUs. Maybe my EPYC system seems similarly fast because the EPYC is just brute forcing it. It had much higher IPC than the old Broadwell based Intel.

____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59215 - Posted: 10 Sep 2022 | 17:57:37 UTC

One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:

https://www.gpugrid.net/result.php?resultid=33030599

As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59216 - Posted: 10 Sep 2022 | 18:11:25 UTC - in response to Message 59215.

One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:

https://www.gpugrid.net/result.php?resultid=33030599

As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs).


The calculated runtime is using the cpu time. Has been mentioned many times. It’s because more than one core was being used. So the sum of each core’s cpu time is what’s shown.

You did get 48hr bonus of 25%. Base credit is 70,000. You got 87,500 (+25%). Less than 24hrs gets +50% for 105,000.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59217 - Posted: 10 Sep 2022 | 21:14:39 UTC

GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59218 - Posted: 10 Sep 2022 | 23:34:03 UTC - in response to Message 59217.
Last modified: 10 Sep 2022 | 23:36:01 UTC

GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come?


probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.

Options-> Computing Preferences...
Disk and Memory tab

and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59221 - Posted: 11 Sep 2022 | 4:50:08 UTC - in response to Message 59218.


probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.

Options-> Computing Preferences...
Disk and Memory tab

and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.

no, it isn't that.
I am aware of these setting. Since nothing else than BOINC is being done on this computer, disk and RAM usage are set to 90% for BOINC.
So, when I have some 58GB free on a 128GB RAM disk (with some 60GB free system RAM), it should normally be no problem for Python to download and being processed.
On another machine, I have a lot less ressources, and it works.
So no idea, what the problem is in this case ... :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59222 - Posted: 11 Sep 2022 | 6:12:13 UTC

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59223 - Posted: 11 Sep 2022 | 6:42:55 UTC - in response to Message 59222.

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59224 - Posted: 11 Sep 2022 | 6:56:19 UTC

another question -

I think I read something concerning this topic somewhere here, but I cannot find the posting any more (maybe though I am mistaken):

Is there the possibility to limit (by app_config.xml) the number of CPU cores Python is using?
The reason why I am asking is that on that machine onto which Python can be downloaded, I have also another project (not GPU) running, and when Python fills up the number of available cores, the CPU is busy with 100% which slows things down, and also heats up the CPU much more.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59225 - Posted: 11 Sep 2022 | 8:05:35 UTC
Last modified: 11 Sep 2022 | 8:06:55 UTC

No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.

If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.

All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.

Like this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59226 - Posted: 11 Sep 2022 | 12:21:35 UTC - in response to Message 59225.

...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.
Like this: ...

thanks, Keith, for your explanation.

Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(

What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]).

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59228 - Posted: 11 Sep 2022 | 15:27:14 UTC - in response to Message 59223.

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB


Now I tried once more to download a Python on my system with a 128GB Ramdisk (plus 128GB system RAM).
The BOINC event log says:

Python apps for GPU hosts needs 4590.46MB more disk space. You currently have 28788.14 MB available and it needs 33378.60 MB.

Somehow though all this does not fit together: in reality, the Ramdisk is filled with 73GB and has 55GB available.
Further, I am questioning whether Python indeed needs 33.378 MB free disk space for downloading?

I am really frustrated that this does not work :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59229 - Posted: 11 Sep 2022 | 15:30:21 UTC - in response to Message 59226.
Last modified: 11 Sep 2022 | 15:33:39 UTC

...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.
Like this: ...

thanks, Keith, for your explanation.

Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(

What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]).


You are not understanding the nature of the Python tasks. They are not using all your cores. They are not using 32 cores. They are using 32 spawned processes

A process is NOT a core.

The Python task use from 100-300% of a cpu core depending on the speed of the host and the number of cores in the host.

That is why I offered the app_config.xml file to allot 3 cpu cores to each Python task for BOINC scheduling purposes. And you can have many app_config.xml files in play among all your projects as a app_config file is specific to each project and is placed into the projects folder. You certainly can use one for scheduling help for GPUGrid.

A app_config file does not control the number of cores a task uses. That is dependent soley on the science application. A task will use as many or as little cores as needed.

The only exception to that fact is in the special case of plan_class MT like the cpu tasks at Milkyway. Then BOINC has an actual control parameter --nthreads that can specifically set the number of cores allowed in the MT plan_class task.

That cannot be used here because the Python tasks are not a simple cpu only MT type task. They are something completely different and something that BOINC does not know how to handle. They are a dual cpu-gpu combination task where the majority of computation is done on a cpu with bursts of activity on a gpu and then computation repeats that action.

It would take a major rewrite of core BOINC code to properly handle this type of machine-learning, reinforcement learning combo tasks. Unless BOINC attracts new developers that are willing to tackle this major development hurdle, the best we can do is just accommodate these tasks through other host controls.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59230 - Posted: 11 Sep 2022 | 15:40:07 UTC

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59231 - Posted: 11 Sep 2022 | 16:47:19 UTC - in response to Message 59230.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59232 - Posted: 11 Sep 2022 | 17:19:36 UTC - in response to Message 58980.

The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.

There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...


a suggestion for whenever you're able to move to to pure GPU work. PLEASE look into and enable "automatic mixed precision" in your code.

https://pytorch.org/docs/stable/notes/amp_examples.html

this should greatly benefit those devices which have Tensor cores. to speed things up.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59233 - Posted: 11 Sep 2022 | 18:48:40 UTC - in response to Message 59231.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59234 - Posted: 11 Sep 2022 | 20:06:29 UTC - in response to Message 59233.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed.

I just noticed the same problem with Rosetta Python tasks. So this may be in some kind of relation with the Python architecture.
Also in the Rosetta case, the actual disk space available was significantly higher than Rosetta said it would need.
So I don't believe that this has anything to do with the fixed ram disk size. What is the logic behind your assumption?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59235 - Posted: 12 Sep 2022 | 0:58:35 UTC - in response to Message 59234.

If you read the through the various posts, including mine, or investigate the issues with Pytorch on Windows, it is because of the nature of how Windows handles reservation of memory addresses compared to how Linux handles that.

The Pytorch libraries when downloaded and expanded ask for many gigabytes of memory. Windows has to set aside every bit of memory space that the application asks for whether it will be needed or not. Linux does not have to abide by this fact since it handles memory allocation dynamically automatically.

And since every Python task is likely different, there is no reuse of the previous Pytorch libraries likely, so every task needs to get all of its configured resources every time a new task is executed.

So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.

Read this explanation please for the actual particulars of the problem with Windows. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59236 - Posted: 12 Sep 2022 | 6:54:49 UTC - in response to Message 59235.

So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.

thanks for the hint, I will adapt the page file size accordingly and see what happens.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59237 - Posted: 12 Sep 2022 | 14:43:46 UTC - in response to Message 59213.

Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys

"""
if __name__ == "__main__":

import sys
sys.stderr.write("Starting!!\n")
import os

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"

import platform
"""


____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59238 - Posted: 12 Sep 2022 | 14:58:40 UTC - in response to Message 59237.
Last modified: 12 Sep 2022 | 15:33:37 UTC

Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys

"""
if __name__ == "__main__":

import sys
sys.stderr.write("Starting!!\n")
import os

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"

import platform
"""



thanks :) I'll try anyway

edit - nope, no different.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59239 - Posted: 12 Sep 2022 | 15:35:31 UTC - in response to Message 59237.
Last modified: 12 Sep 2022 | 15:37:58 UTC

really unfortunate to use so much more resources on AMD than Intel. It's something about the multithreaded nature of the main run.py process itself. on intel it uses about 2-5% per process, and more run.py processes spin up the more cores you have. with AMD, it uses like 20-40% per process, so with high core count CPUs, that makes total CPU utilization crazy high.

here is what it looks like running 4x python tasks (2 GPUs, 2 tasks each) on an intel 8-core, 16-thread system. what you're seeing is the 4 main run.py processes and their multithreaded components. notice that the total CPU used by each main process is a little more than 100%, this equates to a full thread for each process.


now here is what it looks like running only 2x python tasks (1 GPU, 2 tasks each) on an AMD EPYC system with 24-cores, 48-threads. you can see the main run.py multithread components each using 20-40%, and each thread cumulatively using 600-800% CPU, EACH. that's 6-8 whole threads occupied for a single process. making it roughly 6-8x more resource intensive to run on AMD than Intel.


I even swapped my 8c/16t intel CPU for a 16t/32c one, and while it spun up a more multithread components for the main run.py, each one was still only 2-5% used making it only about 150% CPU used from each main process. something definitely weird going on with these task between AMD and Intel

the CPU used by the 32x multiprocessing.spawns is about the same between intel and AMD. it's only the threads that stem from the main run.py process that's showing this huge difference.
____________

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 859,488,649
RAC: 428,566
Level
Glu
Scientific publications
watwatwat
Message 59240 - Posted: 12 Sep 2022 | 15:57:00 UTC - in response to Message 59225.

No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.

If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.

All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.

Like this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35%

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59241 - Posted: 12 Sep 2022 | 16:01:23 UTC - in response to Message 59240.

does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35%


not directly. but if your GPU is being bottlenecked by not enough CPU resources then it could help.

the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.

____________

gemini8
Avatar
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,234,559,169
RAC: 132,659
Level
Phe
Scientific publications
watwat
Message 59248 - Posted: 13 Sep 2022 | 7:48:09 UTC - in response to Message 59241.

Hi everyone.

the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.

I'm thinking about putting every other Boinc CPU work into a VM instead of running it directly on the host.
You could have a VM using only 90 per cent of processing power through the VM settings.
This would leave the rest for the Python stuff, so on a sixteen-thread CPU it could use 160% of one thread's power or 10% of the CPU.
If this wasn't enough the VM could be adjusted to only using eighty per cent (320% of one thread's power or 20% of the CPU for the Python work) and so on.
Return [adjust and try] until the machine does fine.

Plus, you could run other GPU stuff on your GPU to have it fully utilized which should prevent high temperature variations which I see as unnecessary stress for a GPU.
MilkyWay has a small VRAM footprint and doesn't use a full GPU, and maybe I'll try WCG OPNG as well.
____________
- - - - - - - - - -
Greetings, Jens

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59251 - Posted: 13 Sep 2022 | 19:24:52 UTC - in response to Message 59248.

... and maybe I'll try WCG OPNG as well.

forget about WCG OPNG for the time being. Most of the time no tasks available; and if tasks are available for a short period of time, it's extremely hard to get them downloaded. The downloads get stuck most of the time, and only manual intervention helps.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59254 - Posted: 14 Sep 2022 | 18:08:39 UTC

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59255 - Posted: 14 Sep 2022 | 19:56:05 UTC - in response to Message 59254.
Last modified: 14 Sep 2022 | 20:18:46 UTC

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

They save checkpoints well which are replayed to get the task back to the point in progress it was at before interruption.

Just be advised, that the replay process takes a few minutes after restart. The task will show 2% completion percentage upon restart but will eventually jump back to the progress point it was at and continue calculation until end.

Just be patient and let the task run.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59259 - Posted: 15 Sep 2022 | 11:42:38 UTC - in response to Message 59255.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59260 - Posted: 15 Sep 2022 | 15:59:48 UTC - in response to Message 59259.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59261 - Posted: 16 Sep 2022 | 8:03:30 UTC - in response to Message 59259.
Last modified: 16 Sep 2022 | 8:09:53 UTC

The restart is supposed to work fine on Windows as well. Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?

Is there anyone for which the Windows checkpointing works fine? I tested locally and it worked.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59262 - Posted: 16 Sep 2022 | 15:48:36 UTC - in response to Message 59261.
Last modified: 16 Sep 2022 | 16:36:09 UTC

Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?

I can pause and restart them with no problem. The error occurred only on reboot.
But I think I have found it. I was using a large write cache, PrimoCache, set with a 8 GB cache size and 1 hour latency. By disabling that, I am able to reboot without a problem. So there was probably a delay in flushing the cache on reboot that caused the error.

But I used the write cache to protect my SSD, since I was seeing writes of around 370 GB a day, too much for me. But this time I am seeing only 200 GB/day. That is still a lot, but not fatal for some time. It seems that the work units vary in how much they will write. I will monitor it.

I use SsdReady to monitor the writes to disk; the free version is OK.

PS - I can set PrimoCache to only a 1 GB write-cache size with a 5 minute latency, and it reboots without a problem. Whether that is good enough to protect the SSD will have to be determined by monitoring the actual writes to disk. PrimoCache gives a measure of that. (SsdReady gives the OS writes, but not the actual writes to disk.)

PPS: I should point out that the reason a write cache can cut down on the writes to disk is because of the nature of scientific algorithms. They invariable read from a location, do a calculation, and then write back to the same location much of the time. Then, the cache can store that, and only write to the disk the changes that occur at the end of the flush period. If you have a large enough cache, and set the write-delay to infinite, you essentially have a ramdisk. But the cache can be good enough, with less memory than a ramdisk would require. (And now it seems that 2 GB and 10 minutes works OK.)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59265 - Posted: 18 Sep 2022 | 13:41:51 UTC

Question for the experts here:

One of my PCs has 2 RTX3070 inside, Pythons are running quite well.
The interesting thing is that VRAM usage of one GPU always is about 3.7GB, usage of the other always is about 4.3GB.
So with one of the GPUs I could (try to) process 2 Pythons simultaneously, with the other not (VRAM of the RTX3070 is 8GB).
Is it possible to arrange for such a setting via app_config.xml?

BTW, I know what the app_config.xml looks like for running 2 Pythons on both GPUs (<gpu_usage>0.5</gpu_usage>), but I have no idea how to configure the xml according to my wishes as outlined above.

Can anyone help?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59266 - Posted: 18 Sep 2022 | 13:54:46 UTC - in response to Message 59265.
Last modified: 18 Sep 2022 | 14:51:19 UTC

Sorry. There is no way to configure an app_config to differentiate between devices.

You can only have different settings for different applications.

The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.

But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display. a second task wont use 4.3GB again. most likely only another +3.6
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59267 - Posted: 18 Sep 2022 | 14:53:37 UTC - in response to Message 59266.

Sorry. There is no way to configure an app_config to differentiate between devices.

You can only have different settings for different applications.

The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.

But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display.


In fact, I have 2 BOINC clients on this PC; I had to establish the second one with the BOINC DataDir on the SSD, since the first one is on the 32GB Ramdisk which would not let download Python tasks ("not enough disk space").
However, next week I will double the RAM on this PC, from 64 to 128GB, and then I will increase the Ramdisk size to at least 64GB; this should make it possible to download Python - at least that' what I hope.

So then I could run 1 Python on each of the 2 GPUs on the SSD client, and a third Python on the Ramdisk client.
The only two questions now are: how do I tell the Ramdisk client to run only 1 Python (although 2 GPUs available)? And how do I tell the Ramdisk client to choose the GPU with the lower amount of VRAM usage (i.e. the one that's NOT running the display)?

In fact, I would prefer to run 2 Pythons on the Ramdisk client and 1 Python on the SSD client; however, the question is whether I could download 2 Pythons on the 64GB Ramdisk - the only thing I could do is to try.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59268 - Posted: 18 Sep 2022 | 15:22:30 UTC - in response to Message 59267.
Last modified: 18 Sep 2022 | 16:18:36 UTC

please read the BOINC documentation for client configuration. all of the options and what they do are in there.

https://boinc.berkeley.edu/wiki/Client_configuration

you will need to change several things to run multiple clients at the same time. you need to start them on different ports, as well as add several things to cc_config. you will also need to exclude the GPU you dont want to use from each client.

either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59269 - Posted: 18 Sep 2022 | 16:26:35 UTC - in response to Message 59267.
Last modified: 18 Sep 2022 | 16:30:24 UTC

personally I would stop running the ram disk. it's just extra complication and eats up ram space that the Python tasks crave. your biggest benefit will be moving to linux, it's easily 2x faster, maybe more. I don't know how you have your systems set up, but i see your longest runtimes on your 3070 are like 24hrs. that's crazy long. are you not leaving enough CPU available? are you running other CPU work at the same time?

for comparison, I built a Linux machine dedicated to these tasks. 2x RTX 3060 and a 24-core EPYC CPU and 128GB system ram. I am not running any other work on it, only PythonGPU. to give these tasks the optimum conditions to run as fast as possible.

with 12GB of VRAM, i can run 3x per GPU and it completes tasks in about 13hrs at the longest, for an effective longest completion time of about 1 task every 4.3hrs, which means at minimum, this system with 2x GPUs (6x tasks running) completes about 11 tasks per day (1,155,000 cred) + the bonus of some tasks completing earlier. you can see that my 3060 in this system is 6x more productive than your 3070. that's an insane difference

doing this uses about 80-90% of the CPU, and ~56GB of system ram. I have enough spare VRAM to add another GPU, but maybe not enough CPU power to support more than 1 more task. if I want another GPU i will probably need a more powerful (more cores) CPU.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59270 - Posted: 18 Sep 2022 | 17:01:28 UTC - in response to Message 59268.

...
either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)

thanks very much for your hints:-)

One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:

"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"

So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).

Any idea how mugh system RAM, roughly, a Python task takes?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59271 - Posted: 18 Sep 2022 | 17:23:23 UTC - in response to Message 59270.


One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:

"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"

So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).

Any idea how mugh system RAM, roughly, a Python task takes?

From what I can see in the Windows Task Manager on this PC and on others running Python tasks, RAM usage of a Python can be from about 1GB to 6GB (!)
How come that it varies that much?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59272 - Posted: 18 Sep 2022 | 17:34:44 UTC - in response to Message 59271.
Last modified: 18 Sep 2022 | 17:35:32 UTC

you should figure 7-8GB per python task. that's what it seems to use on my linux system. i would imagine it uses a little when the task starts up, then slowly increases once it gets to running full out. that might be the reason for the variance of 1GB in the beginning, and 6+GB by the time it gets to running the main program.

these tasks work in 3 phases from what i've seen

Phase 1: extraction phase. just extracting the compressed package. usually takes about 5 minutes, depending on CPU speed. uses only a single core.

Phase 2: pre-processing and/or pre-loading. uses a large % of CPU power, GPU gets intermittently used, and VRAM preloads to about 60% of what will be eventually used. (in my case, VRAM preloads about 2100MB). this also lasts about 5 mins.

Phase 3: main program. CPU use drops down, and VRAM use loads up to 100% of what is needed (in my case 3600MB per task).
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59280 - Posted: 20 Sep 2022 | 10:08:45 UTC - in response to Message 59254.

Erich56 asked:

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59281 - Posted: 20 Sep 2022 | 11:58:10 UTC

since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside.

The task starts okay, RAM as well as VRAM is filling up continuously, also the CPU usage is close to 100%, and after a while (a few minutes up to half an hour) the task fails.
The BOINC manager says "aborted by the project", and the task description says "aufgegeben" = abandoned or so.

Interestingly, no times are shown, neither runtime nor CPU time, further there is no stderr.

See this example:

https://www.gpugrid.net/result.php?resultid=33044774

on another machine, I have two tasks running simultaneously on one GPU - no problem at all.

I was of course thinking of a defective RAM module; however, all night through I had running simultaneously 5 LHC ATLAS tasks 3-cores ea., without any problem. So I guess this was RAM test enough.

Also hundreds of WCG GPU tasks were processed this morning for hours, also without any problem.

Anyone and ideas ?

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 859,488,649
RAC: 428,566
Level
Glu
Scientific publications
watwatwat
Message 59285 - Posted: 20 Sep 2022 | 18:03:37 UTC - in response to Message 59153.
Last modified: 20 Sep 2022 | 18:09:57 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart


Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59286 - Posted: 20 Sep 2022 | 18:32:36 UTC - in response to Message 59260.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59287 - Posted: 20 Sep 2022 | 19:22:29 UTC - in response to Message 59281.


Anyone and ideas ?

Get rid of the ram disk.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59288 - Posted: 20 Sep 2022 | 19:25:45 UTC - in response to Message 59285.
Last modified: 20 Sep 2022 | 19:26:22 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart


Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu



Any already downloaded task will see the original cpu-gpu resource assignment.

Any newly downloaded task will show the NEW task assignment.

The name for the tasks is PythonGPU as you show.

You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59289 - Posted: 20 Sep 2022 | 19:29:24 UTC - in response to Message 59286.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.


If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart.

It normally shows the failure for this reason in the stderr.txt.

Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59290 - Posted: 20 Sep 2022 | 19:52:08 UTC - in response to Message 59287.


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59291 - Posted: 20 Sep 2022 | 20:40:28 UTC - in response to Message 59290.


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 859,488,649
RAC: 428,566
Level
Glu
Scientific publications
watwatwat
Message 59292 - Posted: 21 Sep 2022 | 16:42:56 UTC

Keith Myers thanks!

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 859,488,649
RAC: 428,566
Level
Glu
Scientific publications
watwatwat
Message 59293 - Posted: 22 Sep 2022 | 2:53:36 UTC

In my case config didn't want to work until I added <max_concurrent>


<app_config>

<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59294 - Posted: 22 Sep 2022 | 4:23:38 UTC - in response to Message 59293.

In my case config didn't want to work until I added <max_concurrent>


<app_config>

<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though


If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59297 - Posted: 22 Sep 2022 | 18:44:15 UTC - in response to Message 59291.
Last modified: 22 Sep 2022 | 18:47:30 UTC


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.


I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day).
So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime.

Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine.

So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all.
This evening, after about 22-1/2 hours, the Python finished successfully :-)
BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time.

Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one.
Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59307 - Posted: 24 Sep 2022 | 17:46:41 UTC - in response to Message 56977.
Last modified: 24 Sep 2022 | 17:49:29 UTC

I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application.
Only now I noticed what is contained in the slotX directory when performing a task.
I was very surprised to see there, in addition to the unpacked application files, also the archive itself, from which these files are unpacked/unzipped. At the same time, the archive is present in two copies at once, apparently due to the suboptimal process of unpacking the format tar.gz.
Here you can see that application's files itself occupy only half the working directory volume(slotX).



Apparently, when the application starts, the following happens:
1) The source archive(pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17) of application is copied from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\).
2) Then archive is unzipped (tar.gz >> tar).
3) At the last stage, the application files are unpacked from tar container.
At the same time, at the end of the process, unnecessary tar and tar.gz files( for some reason) does not deleted from working directory.
Thus, not only the peak amount of space occupied of each instance of this WU requires ~16 GiB, but this volume is occupied until WU's completing.

The whole process requires both much more time (copying and unpacking) and amount of written data.
Project tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (5,46 GiB) = 13,6 GiB

Both parameters can be significantly reduced by unpacking files directly into the working directory from the source archive, without all mentioned intermediate stages.
7za, which is used for unzipping/unpacking archives supports pipelining:


7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\"

Project tar.gz >> app files (5,46 GiB) = 5,46 GiB !


Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive.

Saving more than one GiB:



On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds.
And when using the recent version of 7za(22.01) only ~ 45-50 seconds.

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\"


I believe that the result of the described changes is worth implementing them (even if not all and/or not at once).
Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59308 - Posted: 24 Sep 2022 | 21:32:01 UTC

I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention.

It requires each volunteer to add support manually to their hosts.

In the quest for compatibility, a researcher tries to package applications for all attached hosts to run natively without jumping through hoops so that everyone can run the tasks.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59309 - Posted: 24 Sep 2022 | 21:53:22 UTC - in response to Message 59308.
Last modified: 24 Sep 2022 | 21:56:53 UTC

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59310 - Posted: 25 Sep 2022 | 2:31:01 UTC
Last modified: 25 Sep 2022 | 2:42:26 UTC

Yes, I do have GPUGrid installed on my Win10 machine after all.
And 7za.exe is in the project folder, just not in the project folder on my Ubuntu machine.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59311 - Posted: 25 Sep 2022 | 3:07:56 UTC - in response to Message 59309.

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.

OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer.

If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59312 - Posted: 25 Sep 2022 | 10:39:15 UTC - in response to Message 59311.

you should inform the developer of your analysis and code fix so that other Windows users can benefit.

I have already sent abouh PM to this tread, just in case.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59335 - Posted: 26 Sep 2022 | 7:33:56 UTC - in response to Message 59307.
Last modified: 26 Sep 2022 | 7:45:45 UTC

Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right:



Change A --> As you say, the original file .tar.gz is first copied to the working directory and then unpacked in a 2-step process (tar.gz to tar and tar to plain files) and the tar.gz and tar files lie around after that. You suggest that these files should be deleted to save space and I agree, makes sense. Probably the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to .tar
3) delete .tar.gz file
4) unpack .tar file to plain files
5) delete .tar file
This one is straightforward to implement.




Change B --> Additionally, you also suggest to replace the copying and the 2-step unpacking process for a single step process with the command line you propose. So the sequence would be further simplified to:
1) unpack .tar.gz to plain files
2) delete .tar.gz file
The only problem I see here is that I believe the step of first copying the files from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\) I can not modify. It is general for all projects, even for the ones that do not contain files to be unpacked later. So not to mess with other GPUgrid projects the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to plain files
3) delete .tar.gz file

in this case, would the command line would be simply this one? without the -o flag?

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar





Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za.


Is all the above correct?

I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59336 - Posted: 26 Sep 2022 | 9:26:17 UTC - in response to Message 59335.
Last modified: 26 Sep 2022 | 9:26:45 UTC

Looks good to me. Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za? I think it's unlikely to affect us, but it would be good to check, just in case.

I mention it, because the original trial runs used native Windows tar decompression (the same as the Linux implementation): but that was only introduced in later versions of Windows 10 and 11. Some of us (myself included) still use Windows 7, which supports 7z but not tar. A reasonable degree of backwards compatibility is desirable!

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59337 - Posted: 26 Sep 2022 | 12:00:48 UTC - in response to Message 59335.
Last modified: 26 Sep 2022 | 12:05:18 UTC

Hi, abouh!

Change A:
You are correct.

Change B
You are correct.
2) If this can't be changed or too hard / long to implement - no big deal.
In any case, pipelining still save some time and space : )


in this case, would the command line would be simply this one? without the -o flag?

Of course, if you launch 7za from working directory(/slots/X), than output flag not necessary.

Change C
You are correct.
Using 7z format(LZMA2 compression) significantly reduce archive size, save your bandwidth and some time for unpacking/unzipping process ; )
As I wrote above, the 7za command will be simplified, since the pipelining process will no longer be required.
NB! It is important to update the supplied 7za to current version, since version 9.20, a lot of optimizations have been made for compression/decompression of 7z archives(LZMA).


Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za?

As mentioned on 7-Zip homepage, app support all versions since Windows 2000:

7-Zip works in Windows 10 / 8 / 7 / Vista / XP / 2019 / 2016 / 2012 / 2008 / 2003 / 2000.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59340 - Posted: 26 Sep 2022 | 13:39:36 UTC
Last modified: 26 Sep 2022 | 13:42:42 UTC

As a very first step I am trying to remove the .tar.gz file. I am encountering a first issue. The steps of the jobs are specified in the job.xml file in the following way:

<job_desc>

<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 -y</command_line>
</task>

<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar -y</command_line>
</task>

....

<job_desc>


Essentially I need to execute a task that removes the pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 file after the very first task.

When I try in the Windows command prompt:

cmd.exe /C "del pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"


it works. However when I add to the job.xml file

<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>


The wrapper seems to ignore it. Doesn't the wrapper have cmd.exe? I need to run more tests to figure out the exact command to delete files
____________

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59341 - Posted: 26 Sep 2022 | 14:09:41 UTC - in response to Message 59340.

<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>

Try to use %COMSPEC% variable as alias to %SystemRoot%\system32\cmd.exe
If this doesn't work, then I'm sure specifying the full path(C:\Windows\system32\cmd.exe) should work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59343 - Posted: 26 Sep 2022 | 14:50:29 UTC

in other news. looks like we've finally crunched through all the tasks ready to send. all that remains are the ones in progress and the resends that will come from those.

any more coming soon?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59347 - Posted: 27 Sep 2022 | 13:53:43 UTC - in response to Message 59341.
Last modified: 27 Sep 2022 | 14:06:09 UTC

True! Specifying the whole path works:

<job_desc>

<task>
<application>C:\Windows\system32\cmd.exe</application>
<command_line>/C "del \pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17</command_line>
</task>

</job_desc>


I have deployed this Change A into the PythonGPUbeta app, just to test if it works in all Windows machines. Just sent a few (32) jobs. If it works fine on, will move on to introduce the other changes.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59348 - Posted: 27 Sep 2022 | 14:02:54 UTC - in response to Message 59343.
Last modified: 27 Sep 2022 | 14:05:21 UTC

I will be running new experiments shortly. My idea is to use the whole capacity of the grid. I have already noticed that a few months ago it could absorb around 800 tasks and now it goes up to 1000! Thank you for all the support :)
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59354 - Posted: 28 Sep 2022 | 7:10:17 UTC

The first batch I sent to PythonGPUbeta yesterday failed, but I figured out the problem this morning. I just sent another batch an hour ago to the PythonGPUbeta app. This time seems to be working. It has Change A implemented, so memory usage is more optimised.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59356 - Posted: 28 Sep 2022 | 7:49:54 UTC - in response to Message 59337.
Last modified: 28 Sep 2022 | 8:50:11 UTC

Hello Aleksey!

I was looking at how to implement Chance C, namely if we can encode and decode the task conda-environment files using 7zip format and recent versions of 7za.exe.

We use conda-pack to compress the conda environment that we later unpack in the gpugrid windows machines using 7za.exe.

However, looking at the documentation seems like 7zip is not a format conda-pack can deal with. https://conda.github.io/conda-pack/cli.html

Apparently the possible formats include: zip, tar.gz, tgz, tar.bz2, tbz2, tar.xz, txz, tar, parcel (?), squashfs (?)

So in case of switching from the current tar.gz, we could only go to one of these. Maybe tbz2 or txz? seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.

Any recommendation? :)

For tbz2 the file size is similar, slightly smaller. The txz file is substantially smaller but took forever (30 mins) to compress.
2.0G pythongpu_windows_x86_64__cuda102.tar.gz
1.9G pythongpu_windows_x86_64__cuda102.tbz2
1.2G pythongpu_windows_x86_64__cuda102.txz
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59357 - Posted: 28 Sep 2022 | 12:25:25 UTC

more tasks? I'm running dry ;)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59358 - Posted: 28 Sep 2022 | 15:39:18 UTC

More tasks please, also.

bibi
Send message
Joined: 4 May 17
Posts: 15
Credit: 16,302,257,990
RAC: 10,855,198
Level
Trp
Scientific publications
watwatwatwatwat
Message 59359 - Posted: 28 Sep 2022 | 16:13:06 UTC - in response to Message 59356.

Hi,

why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.
When it works, 7za.exe and this extra tasks are not necessary.

pythongpu_windows_x86_64__cuda1131.zip has 2,58 GB
pythongpu_windows_x86_64__cuda1131.tar.gz has 2,66 GB

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59360 - Posted: 28 Sep 2022 | 18:26:19 UTC - in response to Message 59356.
Last modified: 28 Sep 2022 | 18:42:56 UTC

Good day, abouh

This time seems to be working. It has Change A implemented,

It's nice to hear that!

Maybe tbz2 or txz?

As I understand, tbz2/txz are alias of file extension for tar.bz2/tar.xz.
So in fact these formats are tar containers which compressed by bz2 or xz.
Therefore, this will require pipelining process, which, however, practically does not affect the unpacking speed, and only lengthens command string.
In my test, unpacking of tar.xz done in ~40 seconds.

seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.

xz format supported since version 9.04 beta, but more recent version support multi-threaded (de)compression, witch crucial for fast unpacking.


The txz file is substantially smaller but took forever (30 mins) to compress.

This format use LZMA2 algorithm, similar as 7z use by default. So space saving must be the same with the same settings(--compress-level).
It's highly likely you forgot to use this flag
--n-threads <n>, -j <n>

to set number of threads to use for compression. By default conda-pack use only 1 thread!
And also check --compress-level. Levels higher then 5 not so effective for compression_time/archive_size.
Considering how I think that PythonGPU's app file rarely changes, it's not big deal.
As far as I remember, this (practically) does not affect unpacking speed.
On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB).

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59361 - Posted: 28 Sep 2022 | 18:56:09 UTC - in response to Message 59359.
Last modified: 28 Sep 2022 | 19:47:06 UTC

why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.

You're probably right.
I somehow didn't pay attention to acemd3 archives in project directory.
Is there some info, how BOINC's work with archives?
I suppose boinc-client uses its built-in library to work with archives (zlib ?), rather than some OS functions/tools.

There's still a dilemma:
1) On the one hand, using zip format will simplify process of application launching and reduce the amount of disk space required by application (no need to copy archive to the working directory). Amount of written data on disk reduced accordingly.
2) On other hand, xz format reduce archive size by whole GiB, that helps to save project's network bandwidth and time to download necessary files at first users access to project.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59362 - Posted: 28 Sep 2022 | 19:56:28 UTC - in response to Message 59360.

On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB).

It's about compression*

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59364 - Posted: 29 Sep 2022 | 14:16:34 UTC - in response to Message 59359.
Last modified: 29 Sep 2022 | 14:49:18 UTC

We tried to pack files with zip at first but encountered problems in windows. Not sure if it was some kind of strange quirk in the wrapper or in conda-pack (the tool for creating, packing and unpacking conda environments, https://conda.github.io/conda-pack/), but the process failed for compressed environment files above a certain memory size.

We then tried to used another format that could compress the files to a smaller size than .zip. We tried .tar but not all windows version have tar.exe (old ones do not).

We finally found this solution of sending 7za.exe along with the conda packed environment to be able to unpack it as part of the job.

I am not 100% sure, but I suspect acemd3 does not use PyTorch machine learning python framework, which increases substantially the size of the packed environment. And I believe acemd4 does use pytorch, and faces the same issue as the PythonGPU tasks.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59365 - Posted: 29 Sep 2022 | 14:54:43 UTC - in response to Message 59362.

You were absolutely right, I forgot the number of threads! I could now reproduce a a much faster compression as well.

I will proceed to test if I can use the BOINC wrapper and a newer version of 7za.exe to unpack it locally in a reasonable amount of time and then will deploy it to PythonGPUbeta for testing.

Thank you very much!
____________

bibi
Send message
Joined: 4 May 17
Posts: 15
Credit: 16,302,257,990
RAC: 10,855,198
Level
Trp
Scientific publications
watwatwatwatwat
Message 59366 - Posted: 29 Sep 2022 | 18:21:05 UTC - in response to Message 59365.

Hi abouh,

the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01 (now 7z.exe).
If you want to unpack in a pipe or delete the tar file, you need cmd. But the used starter wrapper_6.1_windows_x86_64.exe (see project folder) don't know about environment and the windows folder isn't necessarily c:\windows, so you also should provide cmd.exe.
Unpacking in a pipe:
<task>
<application>.\cmd.exe</application>
<command_line>/c .\7za.exe -so x pythongpu_windows_x86_64__cuda1131.tar.xz | .\7za.exe -y -sifile.txt.tar x & exit</command_line>
<weight>1</weight>
</task>

Why conda-pack with format zip is not working I don't know.

bibi
Send message
Joined: 4 May 17
Posts: 15
Credit: 16,302,257,990
RAC: 10,855,198
Level
Trp
Scientific publications
watwatwatwatwat
Message 59367 - Posted: 29 Sep 2022 | 18:56:26 UTC - in response to Message 59366.

7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html
But your version works too.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59368 - Posted: 29 Sep 2022 | 21:01:20 UTC - in response to Message 59366.


the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01


7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html

All this has already been discussed by several posts above.
If you had read before writing...

so you also should provide cmd.exe.

I think this is not a good idea.
Some antiviruses may perceive an attempt to launch cmd.exe not from the system directory as suspicious/malicious activity.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59369 - Posted: 30 Sep 2022 | 5:44:38 UTC
Last modified: 30 Sep 2022 | 5:47:12 UTC

I added the discussed changes and deployed them to the PythonGPUbeta app. More specifically:

1. I changed the 7za.exe executable to (I believe) the latest version. A much newer one than the one previously used in any case.

2. I compress now the conda-environment files to .txz. I use the default --compress-level (4), because I tried with 9 and the compressed file size was the same.

As Aleksey mentioned, the unpacking still needs to be done in 2 steps, but at least now the sent files are smaller due to a more efficient compression.

Did anyone catch any of the PythonGPUbeta jobs? They seemed to work

Regarding what bibi mentioned, /Windows/System32/cmd.exe seems to be present in all Windows machines so far, or at least I have not seen any job failing because of this. I have sent 64 test jobs in total.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59370 - Posted: 30 Sep 2022 | 5:48:53 UTC - in response to Message 59369.

No, I haven't been lucky enough yet to snag any of the beta tasks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59371 - Posted: 30 Sep 2022 | 7:38:17 UTC
Last modified: 30 Sep 2022 | 7:42:38 UTC

One of my Linux machines has just crashed two tasks in succession with

UnboundLocalError: local variable 'features' referenced before assignment

https://www.gpugrid.net/results.php?hostid=508381

Edit - make that three. And a fourth looks to be heading in the same direction - many other users have tried it already.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59372 - Posted: 30 Sep 2022 | 8:25:10 UTC - in response to Message 59371.

Thanks for the warning Richard, I have just fixed the error. Should not be present in the jobs starting a few minutes from now.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59373 - Posted: 30 Sep 2022 | 9:11:53 UTC - in response to Message 59372.

Yes, the next one has got well into the work zone - 1.99%. Thank you.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59374 - Posted: 30 Sep 2022 | 9:28:33 UTC
Last modified: 30 Sep 2022 | 9:29:46 UTC

Just an observation.
Boinc does not consider a GPUGrid task as a task. Yesterday my finger brushed against Moo's "allow new WU's" and it promptly downloaded 12 WU"s. They, were all 12 running with the GPUGrid task also running? Never seen that before. I took remedial action. None errored out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59375 - Posted: 30 Sep 2022 | 10:37:03 UTC

I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.

On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.

After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430

I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.

What's going wrong?

FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59376 - Posted: 30 Sep 2022 | 14:54:53 UTC

My question is, how can 13 tasks run on a 12-thread machine? Is it a good idea to run other tasks? Also, why was Boinc not taking into account the GPUGrid task?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59377 - Posted: 30 Sep 2022 | 15:16:27 UTC - in response to Message 59376.

If the 13th task is assessed - by the project and BOINC in conjunction - to require less than 1.0000 of a CPU, it will be allowed to run in parallel with a fully occupied CPU. For a GPU task, it will run at a slightly higher CPU priority, so it will steal CPU cycles from the pure CPU tasks - but on a modern multitasking OS, they won't notice the difference.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59379 - Posted: 30 Sep 2022 | 16:22:35 UTC - in response to Message 59375.

I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.

On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.

After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430

I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.

What's going wrong?

FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.



i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).

if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.

switch to Linux for even better performance.

____________

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,740,982,209
RAC: 1,059,532
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59380 - Posted: 1 Oct 2022 | 1:04:27 UTC - in response to Message 59379.

Erich56

The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.

Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.

Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.

There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.

8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.

Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.

Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.

After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.

While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59381 - Posted: 1 Oct 2022 | 6:13:19 UTC - in response to Message 59379.

Ian&Steve C. wrote:

i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).

if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.

switch to Linux for even better performance.

I agree, at the moment it may be "too much at once" :-)

FYI, I recently bought another PC with 2 CPUs (8-c/8-HT each) and 1 GPU, I upgraded the RAM from 128GB to 256GB and created a 128GB Ramdisk;
and on an existing PC with a 10-c/10-HT CPU plus 2 RTX3070 I upgraded the RAM from 64GB to 128GB (=maximum possible on this MoBo).

So no surprise that now I am just testing what's possible. And by doing this, I keep finding out, of course, that sometimes I am expecting too much.

What concerns the (low) speed of my two RTX3070: I have always been on the very conservative side what concerns GPU temperatures. Which means I have them run on about 60/61°C, not higher.
With two such GPUs inside the same box, heat of course is a topic. Despite of good airflow, in order to keep the GPUs at the above mentioned temperature, I need to throttle them down to about 50-65% (different for each GPU). So this explains for the longer runtimes of the Pythons.
If I had to boxes with 1 RTX3070 inside each, I am sure that there would be no need for throtteling.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59382 - Posted: 1 Oct 2022 | 7:35:58 UTC - in response to Message 59380.

jjch wrote:

Erich56

The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.

Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.

Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.

There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.

8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.

Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.

Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.

After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.

While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either.


thanks for taking your time for dealing with my problem.

well, by now it's become clear to me what the cause for failure was:
obviously, running a Primegrid GPU task and Python on the same GPU does not work for the Python. After a Primegrid got finished, I started another Python, and it runs well.

What concerns memory, you may have misunderstood: when I mentioned the 8GB, I meant to say that I could see in the Windows Task Manager that Python was using 8GB. Total RAM on this machine is 64GB, so more than enough.

Also what concerns the swap space: I had set this manually to 100GB min. and 150 GB max., so also more than enough.

Again - the problem has been detected anyway. Whereas I had no problem to run two Pythons on the same GPU (even 3 might work), it is NOT possible to have a Python run along with a Primegrid task.
So for me, this was a good learning process :-)

Again, thanks anyway for your time investigating my failed tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59383 - Posted: 1 Oct 2022 | 7:56:44 UTC

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59386 - Posted: 2 Oct 2022 | 10:41:35 UTC - in response to Message 59383.

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?


Meanwhile, the problem has become even worse:

After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.

Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).

when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.

Can anyone give me advice how to get this problem solved?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59387 - Posted: 2 Oct 2022 | 17:30:49 UTC

It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.

The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59388 - Posted: 2 Oct 2022 | 18:10:14 UTC - in response to Message 59387.

It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.

The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline.

But how come that on three other of my systems on which I am running Pythons for a while, the "remaining runtimes" are shown pretty correctly (+/- 24 hours)?

And also on the machine in question, up to recently the time was indicated okay.
Something must have happened yesterday, but I do not know what.

If your assumption was right, on no Boinc instance more than 1 Python could be run in parallel.
Didn't you say somewhere here in the forum that you are running 3 Pythons in parallel? How can a second and a third task be downloaded if the first one shows a remaining runtime of 30 or 60 days?
What are the remaining runtimes shown for your Pythons once they get started?

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,757,047,630
RAC: 1,071,191
Level
Phe
Scientific publications
wat
Message 59389 - Posted: 2 Oct 2022 | 22:03:50 UTC - in response to Message 59386.
Last modified: 2 Oct 2022 | 22:04:41 UTC

Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. I originally had Resource shares of 160 for GPUGrid vs 10 for Einstein and 40 for TN-Grid. Since the Python tasks 'use' so much CPU time in particular (at least reported CPU time), it seems to affect the Resource Share calculations at well. I had to move my Resource Share of GPUGrid (for example) to 2,000 to get it both to do two at once and to get Boinc to share with Einstein and TN-Grid roughly the way I wanted. (Nothing magic about my Resource Share ratios; just providing an example of how extreme I went to get it to balance the way I wanted.)

Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59390 - Posted: 3 Oct 2022 | 0:49:02 UTC - in response to Message 59388.

No, that was my teammate who is running 3X concurrent on his gpus.

He runs nothing but GPUGrid on those hosts.

I OTOH run multiple projects at the same time on my hosts. So the GPUGrid tasks have to share resources. That is a balancing act.

I run a custom client that allows me to get around the normal BOINC client and project limitations. I can ask for as much or as little amount of work that I want on any host.

Currently, I am running a single task on half a gpu in each host. I tried to run 2X on the gpu but I don't have enough resources to support 2 tasks on the host and run all my other projects at the same time. But the task runs well sharing the gpu with my other gpu projects. Keeps the gpu utilization much higher than if running only the Python task.

The GPUGrid tasks start up with multiple hundreds of days expected before completion. That drops down to only a couple of days once the task gets over 90% completion.

This is what BoincTasks is showing for the 5 tasks I am currently running on my hosts for estimated completion times.

GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00014a06316-ABOU_rnd_ppod_expand_demos25-0-1-RND9172_3 01:05:30 (02:57:04) 90.11 3.970 157d,17:33:34 10/7/2022 4:27:00 PM 3C + 0.5NV (d1) Running High P. Darksider
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00005a00032-ABOU_rnd_ppod_expand_demos25_2-0-1-RND9669_0 13:30:26 (04d,00:21:21) 237.79 34.660 27d,12:31:49 10/7/2022 4:02:16 AM 3C + 0.5NV (d2) Running High P. Numbskull
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00012a04847-ABOU_rnd_ppod_expand_demos25-0-1-RND2344_4 13:27:51 (01d,09:45:50) 83.59 48.520 10d,20:41:45 10/7/2022 4:05:00 AM 3C + 0.5NV (d1) Running High P. Pipsqueek
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00015a05913-ABOU_rnd_ppod_expand_demos25-0-1-RND9942_0 21:04:49 (05d,14:22:40) 212.49 39.610 28d,03:53:33 10/6/2022 8:04:45 PM 3C + 0.5NV (d2) Running High P. Rocinante
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00008a00044-ABOU_rnd_ppod_expand_demos25_2-0-1-RND2891_2 01:23:31 (02:53:39) 69.30 3.970 22d,07:56:42 10/7/2022 4:09:00 PM 3C + 0.5NV (d0) Running High P. Serenity

I'll finish all of the tasks before 24 hours on the high clocked hosts for maximum credit awards. I'll miss out on the 24 hour bonus by a half hour or so on the server hosts because of their slower clocks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59391 - Posted: 3 Oct 2022 | 6:34:19 UTC - in response to Message 59389.

Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times?

On one my hosts a new Python started some 25 minutes ago. "Remaining time" is shown as 13 hrs.
No particular setup. In the past years, this host had crunched numerous ACEMD tasks. Since a few weeks ago, it's crunching Pythons. GTX980Ti. Besides, 2 "Theory" tasks from LHC are running.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59392 - Posted: 3 Oct 2022 | 10:53:17 UTC - in response to Message 59389.

kksplace wrote:

Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. ...

well, my target on this machine, in fact, is not to share Pythons with other projects.
It would simply make me happy if I could run 2 (or perhaps 3) Pythons simultaneously. The hardware requirements should be sufficient.

So, said that, I guess in this case the ressource share would not play any role.

BTW: as mentioned before, until some time early last week I did run two Pythons simultaneously on this PC. I have no idea though what the indicated remaining runtimes were. Most probably not that high as now, otherwise I could not have downloaded and started to Pythons in parallel.

So any idea what I can do to make this machine run at least 2 Pythons (if not 3) ???

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,757,047,630
RAC: 1,071,191
Level
Phe
Scientific publications
wat
Message 59393 - Posted: 3 Oct 2022 | 17:05:04 UTC - in response to Message 59392.

I am limited on any technical knowledge and can only speak how I got mine to work with 2 tasks. Sorry I can't help anymore. As to getting 3 tasks, my understanding from other posts and my own attempt is that you can't without a custom client or some other behind-the-scenes work. The '2 tasks at one time' limit is a GPUGrid restriction somewhere.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59394 - Posted: 3 Oct 2022 | 17:48:02 UTC - in response to Message 59393.

Yes, the project has a max 2 tasks per gpu limit with project max of 16 tasks.

You normally would just implement a app_config.xml file to get two tasks running concurrently on a gpu.

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

That has been the same quota since project inception. The only way to get around it is to spoof the gpu count via locking down the coproc_info.xml file in the BOINC folder.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59395 - Posted: 3 Oct 2022 | 19:19:15 UTC - in response to Message 59394.

...
<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>
...

Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59396 - Posted: 3 Oct 2022 | 19:33:31 UTC - in response to Message 59395.
Last modified: 3 Oct 2022 | 19:34:47 UTC

...
<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>
...

Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?




Exactly what I said in my previous message.

adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.


What Keith suggested would tell BOINC to reserve 3 whole CPU threads for each running PythonGPU task.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59397 - Posted: 4 Oct 2022 | 7:09:44 UTC

Hello!

Today I will deploy the changes tested last week in PythonGPUbeta to the PythonGPU app. The changes only affect Windows machines, and should results in downloading smaller initial files, and slightly less memory requirements.

As we discussed, for now the initial data unpacking still needs to be done in two steps, but using a more recent version of 7za.exe.

I did not detect any error in the PythonGPUbeta tasks, so hopefully this change will no affect jobs in PythonGPU either.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59398 - Posted: 4 Oct 2022 | 7:24:25 UTC - in response to Message 59395.


Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.

That tells BOINC to not commit resources to other projects that it doesn't have so that you aren't running the cpu overcommitted.

It is only for BOINC scheduling of available resources. It does not impact the running of the Python task in any way directly. Only the scientific application itself deteremines how much cpu the task and application will use.

You should never run a cpu in overcommitted state because that means that EVERY application including internal housekeeping is constantly fighting for available resources and NONE are running optimally. IOW's . . . . slooooowwwly.

You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.

If you have a cpu that has 16 cores/32 threads available to the OS, you should strive to use only up to 32 threads over the averaging periods.

The uptime command besides printing out how long the system has been up and running also prints out the 1 minute / 5 minute / 15 minute system average loadings.

As an example on this AMD 5950X cpu in this daily driver this is my uptime report.

keith@Pipsqueek:~$ uptime
00:15:16 up 7 days, 14:41, 1 user, load average: 30.16, 31.76, 32.03

The cpu is right at the limit of maximum utilization of its 32 threads.
So I am running it at 100% utilization most of the time.

If the averages were higher than 32, then that shows that the cpu is overcommitted and trying to do too much all the time and not running applications efficiently.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59399 - Posted: 4 Oct 2022 | 7:28:03 UTC - in response to Message 59397.

Thanks for the notice, abouh. Should make the Windows users a bit happier with the experience of crunching your work.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59400 - Posted: 4 Oct 2022 | 9:41:13 UTC - in response to Message 59398.


Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.

...

You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.
...

thanks, Keith, for the thorough explanation. Now everything is clear to me.
What concerns CPU loading/utilization, so far I have been taking a look at the Windows Task Manager which shows a (rough?) percentage on top of the column "CPU".

However, for me the question still is how I could get my host with the vast hardware ressources (as described here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59383) to run at least 2 Pythons concurrently - as it was the case already before ???

Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59403 - Posted: 4 Oct 2022 | 16:50:27 UTC - in response to Message 59400.
Last modified: 4 Oct 2022 | 16:53:15 UTC


Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures?


There isn't any way to get the estimated time remaining down to reasonable values as far as we know without a complete rewrite of the BOINC client code.

Or ask @kksplace how he managed to do it.

Try to increase your amount of day's cache to 10 and see if you pick up the second task.

Are you running with 0.5 gpu_usage via the app_config.xml file exampleI posted?

You can spoof 2 gpus being detected by BOINC which would automatically increase your gpu task allowance to 4 tasks. You need to modify the coproc_info.xml file and then lock it down to immutable state so BOINC can't rewrite it.

Google spoofing gpus in the Seti and BOINC forums on how to do that.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59404 - Posted: 4 Oct 2022 | 17:21:36 UTC - in response to Message 59403.

Try to increase your amount of day's cache to 10 and see if you pick up the second task.


Counterintuitively, this can actually cause the opposite reaction on a lot of projects.

if you ask for "too much" work, some projects will just shut you out and tell you that no work is available, even when it is. I don't know why, I just know it happens. this is probably why he can't download work.

I would actually recommend keeping this value no larger than 2 days.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59405 - Posted: 4 Oct 2022 | 19:52:23 UTC - in response to Message 59404.

I was assuming that GPUGrid was the only project on his host.

I agree that increasing the value with more than one single project on the host is often deleterious.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59406 - Posted: 4 Oct 2022 | 20:01:57 UTC - in response to Message 59405.

I think GPUGRID is one of the projects that reacts negatively to having the value too high.

but no, based on his daily contributions for this host via FreeDC, he's contributing to several projects.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59407 - Posted: 4 Oct 2022 | 20:35:18 UTC - in response to Message 59405.

I was assuming that GPUGrid was the only project on his host.

at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time.

Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59408 - Posted: 4 Oct 2022 | 21:25:02 UTC - in response to Message 59397.
Last modified: 4 Oct 2022 | 21:34:41 UTC

Today I will deploy the changes tested last week in PythonGPUbeta to the PythonGPU app. The changes only affect Windows machines, and should results in downloading smaller initial files, and slightly less memory requirements.

Thank you, abouh!
Let's try a new tasks :)

Now that's probably need to adjust disk space requirements for PythonGPU tasks, isn't it?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59409 - Posted: 4 Oct 2022 | 21:56:21 UTC - in response to Message 59407.

I was assuming that GPUGrid was the only project on his host.

at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time.

Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host.


even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59410 - Posted: 5 Oct 2022 | 3:13:48 UTC - in response to Message 59409.

even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.

this is what I did anyway

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,740,982,209
RAC: 1,059,532
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59414 - Posted: 9 Oct 2022 | 17:10:30 UTC

Good news since the recent changes to the Windows environment. I have seen a great increase of successful tasks. Seems that others have too as my ranking has dropped a bit.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59416 - Posted: 10 Oct 2022 | 6:01:45 UTC - in response to Message 59414.

So good to hear that!
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59417 - Posted: 10 Oct 2022 | 10:36:38 UTC

When i paused workunit and restarted boinc boinc copied pythongpu_windows_x86_64__cuda1131.txz file in slot directory.
The file was already extracted to pythongpu_windows_x86_64__cuda1131.tar and deleted.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59418 - Posted: 10 Oct 2022 | 11:24:20 UTC - in response to Message 59409.

Ian&Steve C. wrote:

even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.

as said before, I had done this change in the app_config.xml.

After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).

Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".

In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".

FYI, the complete app_config reads as follows:

<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?

I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).

Can anyone give me hints as to what I could do?

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59419 - Posted: 10 Oct 2022 | 11:28:53 UTC
Last modified: 10 Oct 2022 | 11:29:20 UTC

You can reduce hard drive requirement by 1.93 GB if you remove these files from E:\programdata\BOINC\slots\1\Lib\site-packages\torch\lib when windows_fix.py has finished disabling ASLR and making .nv_fatb sections read-only.
05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak
05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak
03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak
05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak
05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak
05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak
03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak
05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak
Can you distribute these dlls already patched with python environment, or does NVIDIA license agreement forbid it?

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59420 - Posted: 10 Oct 2022 | 11:50:45 UTC - in response to Message 59386.

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?


Meanwhile, the problem has become even worse:

After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.

Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).

when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.

Can anyone give me advice how to get this problem solved?


You can add <fraction_done_exact/> to your app_config.xml

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59421 - Posted: 10 Oct 2022 | 12:20:24 UTC - in response to Message 59418.

Ian&Steve C. wrote:
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.

as said before, I had done this change in the app_config.xml.

After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).

Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".

In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".

FYI, the complete app_config reads as follows:

<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?

I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).

Can anyone give me hints as to what I could do?



several things.

first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.

next this line:
<max_concurrent>2</max_concurrent>

this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59422 - Posted: 10 Oct 2022 | 12:46:51 UTC - in response to Message 59421.

several things.

first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.

next this line:
<max_concurrent>2</max_concurrent>

this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.


after changing an app_config file, I always click "read config files" in the Options toolbar menu. As said before, I have worked with app_config.xml files very often for several years, so I am for sure doing it correctly.

I know that tasks downloaded as 1.0 GPU will keep this label.
Here, this is not the question though. Because I had set the 0.5 GPU even before I started downloading Pythons. Since then, 5 Pythons were downloaded (3 of them finished and uploaded, 1 active, another one waiting to start), all of them show 1.0 GPU, for unknown reason.

I know the meaning of
<max_concurrent>2</max_concurrent>
thanks for the hint anyway.

So, as said before: it's totally unclear to me why in this case the app_config does not work. I see this problem for the first time in all the years :-(
What I could still try, after the currently running Python is over, to restart BOINC. Maybe this helps, however, I doubt it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59423 - Posted: 10 Oct 2022 | 12:49:28 UTC - in response to Message 59422.
Last modified: 10 Oct 2022 | 13:01:27 UTC

what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?

or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59424 - Posted: 10 Oct 2022 | 13:39:51 UTC - in response to Message 59423.
Last modified: 10 Oct 2022 | 13:40:38 UTC

what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?

or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder

I now double- and triple-checked everything you mentioned above.
Also, no error/warning/complaint after clicking read config files.
So this really is a huge conondrum :-(

What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.

So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59425 - Posted: 10 Oct 2022 | 13:45:00 UTC - in response to Message 59424.

but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.

please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59426 - Posted: 10 Oct 2022 | 13:49:58 UTC - in response to Message 59424.
Last modified: 10 Oct 2022 | 13:51:50 UTC



What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.

So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to.


this is exactly what I would expect with the config you've described.

2x GPU spoofed = 4 tasks can download. if you have 2 running on a single GPU, then it's properly using 0.5 per GPU. the only way 2x can run on a single GPU is if the value 0.5 is being used. and only 2 running because of your max_concurrent statement (which you need for the spoofed GPU setup, otherwise it will try to run on the nonexistent second GPU and cause errors).

if you want to run 3x on a single GPU now, leave the GPU spoofing in place, change app_config to max_concurrent of 3, and change gpu_usage to 0.33

unless you know how to edit BOINC code and recompile a custom client, you will need to spoof the GPUs to get more tasks to download since the project enforces 2x tasks per GPU. there's no other solution.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59427 - Posted: 10 Oct 2022 | 14:06:14 UTC - in response to Message 59425.

but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.

please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"


sorry I had goofed before. The event log does complain, indeed:

10.10.2022 15:49:42 | GPUGRID | Found app_config.xml
10.10.2022 15:49:42 | GPUGRID | Missing </app> in app_config.xml

however, this does not make any sense, because </app> is not missing, is it?

<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact>
<max_concurrent>3</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>

(I had added the <fraction_done_exact> meanwhile)
As already said, this is exactly the same app which I use on another host, and there it works. I copied it.

And yes, the file is contained in the GPUGRID project folder.


Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59428 - Posted: 10 Oct 2022 | 14:11:54 UTC - in response to Message 59427.

the line <fraction_done_exact> is not right. that's breaking your file.

it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59429 - Posted: 10 Oct 2022 | 14:29:55 UTC - in response to Message 59428.

the line <fraction_done_exact> is not right. that's breaking your file.

it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag

OMG, shame on me :-(

Many thanks for your valuable help.

What I am questioning is how this error can happen by copying the file from another host (on which everything works fine).
Of course, it would have helped if the entry in the event log would have been a little clearer, it was referring to something else.

But anyway, the mistake was clearly on my side, and thanks again for your patience :-)

BTW, now 3 Pythons are running concurrently. Still, the load on the Quadro P5000 is moderate, the load on the 2 Xeon E5 is 100% each.
I will have to observe whether it would'nt make more sense to run 2 Pythons only.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59430 - Posted: 10 Oct 2022 | 18:19:58 UTC - in response to Message 59397.
Last modified: 10 Oct 2022 | 18:23:58 UTC

Good day, abouh
I still see that unpacking is done by 2-step:


".\7za.exe" x pythongpu_windows_x86_64__cuda1131.txz -y

".\7za.exe" x pythongpu_windows_x86_64__cuda1131.tar -y


Is there any problem with implementing pipelined unpacking process?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59431 - Posted: 10 Oct 2022 | 18:35:03 UTC - in response to Message 59429.

The app_config.xml code you posted is not valid as proclaimed by the XML validator.


An error has been found!

Click on to jump to the error. In the document, you can point at with your mouse to see the error message.
Errors in the XML document:
10: 3 The element type "fraction_done_exact" must be terminated by the matching end-tag "</fraction_done_exact>".

XML document:
1 <app_config>
2 <app>
3 <name>PythonGPU</name>
4 <fraction_done_exact>
5 <max_concurrent>3</max_concurrent>
6 <gpu_versions>
7 <gpu_usage>0.5</gpu_usage>
8 <cpu_usage>1.0</cpu_usage>
9 </gpu_versions>
10 </
app>
11 </app_config>

You should always check your syntax of your XML files at the validator.

https://www.xmlvalidation.com/index.php

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59432 - Posted: 10 Oct 2022 | 18:44:15 UTC - in response to Message 59431.

And you shouldn't have a mid-line break, as shown in line 10.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59435 - Posted: 11 Oct 2022 | 4:44:15 UTC

We, "Boincers" are like cows. If there are no WU's. we move on to greener pastures. Forget about running several WU's on one GPU, give my GPU's something to run.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59436 - Posted: 11 Oct 2022 | 5:58:26 UTC - in response to Message 59431.

You should always check your syntax of your XML files at the validator.

https://www.xmlvalidation.com/index.php

Thanks, Keith, for the link. to be frank, I didn't know that such a validator exists.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59437 - Posted: 11 Oct 2022 | 6:16:40 UTC - in response to Message 59436.

Been around and published since early Seti days when we all had to do a lot of XML writing for custom app_info's and app_config's

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59438 - Posted: 11 Oct 2022 | 12:39:14 UTC - in response to Message 59435.

You can run something like this

cd e:\Program Files\BOINC
e:
:loop
TIMEOUT /T 10
boinccmd.exe --project https://www.gpugrid.net update

TIMEOUT /T 120
goto loop

or write something like that for bash.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59439 - Posted: 11 Oct 2022 | 14:06:07 UTC

hey abouh,

I've noticed some new task names containing 'demos25_2-0-1' this differs from the majority of the previous tasks labelled as just 'demos25-0-1'.

can you briefly explain what is different about these tasks? also, the past few days (and mostly with these _2 tasks) the majority of the tasks have been either "early ending" or pre-coded to run a smaller number of iterations leading to very short runtimes (on the order of minutes instead of hours).

Thanks :)
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59440 - Posted: 12 Oct 2022 | 5:05:45 UTC

I notice a big difference in VRAM use between various Python tasks and/or systems, eg:

- GPU with running 3 tasks simultaneously: 5.250 MB
- GPU with running 2 tasks simultaneously: 5.012 MB
- GPU with running 2 tasks simulteanously: 8.055 MB

with the third one cited above I was lucky, VRAM of the GPU is 8.142 MB

(FYI, all values including a few hundred MB for the monitor).

Does anyone else make the same experience?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59441 - Posted: 12 Oct 2022 | 10:38:02 UTC - in response to Message 59430.

Hello Aleksey,

Yes, I struggled a bit with the single command solution. BOINC job requires specifying tasks in the following way.

<task>
<application>XXXXXX.exe</application>
<command_line>XXXXXXXXXXXXX"</command_line>
</task>


And this is the command that should work right?

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.txz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar



Isn't it actually using 7za 2 times? After some testing, the conclusion I arrived to is that in principle it actually requires 2 BOINC tasks to do it, because 7za decompresses .txz to .tar, and then .tar to plain files. The only way to do it in one task would be to compress the files into a format that 7za can decompress in a single call (like zip, but we already discussed that ziped filed are too big).

Does anyone know is that reasoning is correct? can BOINC wrappers execute commands like the one Aleksey suggested?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59442 - Posted: 12 Oct 2022 | 10:54:24 UTC - in response to Message 59439.
Last modified: 12 Oct 2022 | 14:21:04 UTC

Hello, of course, let me explain

tasks names "demos25" and "demos25_2" belong to 2 different variants of the same experiment. In particular the selection of the agents sent to GPUGrid is different.

In both experiments the AI agents sent to GPUGrid learn using Reinforcement Learning, a machine learning technique that allows them to learn specific behaviours from interactions with their simulated environment (actually to make it faster they interact with 32 copies of the environment at the same time, the famous 32 threads). Also in both cases, when the agents "discover" something relevant, the job finishes and the info is sent back to be shared with the rest of the population.

The difference between "demos25" and "demos25_2" experiments is that in "demos25_2" I am experimenting with a more careful selection of the environment regions each agent is targeted to explore. I try to direct each agent to explore a different region of the environment (or with little overlap with the rest). The result is that agents in "demos25_2" are more likely to find something relevant that the rest of the population has not found yet and therefore more likely to finish earlier. The "demos25" experiment, contrarily, uses a more "brute force" approach, and as the population grows it becomes more difficult for new agents to discover new things.

I hope the explanation will make sense. Let me know if you have any other doubt, I will try to answer to it as well. There is also an experiment "demos25_3" in process which is similar to "demos25_2".
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59443 - Posted: 12 Oct 2022 | 11:33:22 UTC - in response to Message 59442.
Last modified: 12 Oct 2022 | 11:34:44 UTC

Each task patches several dlls to disable ASLR and make .nv_fatb sections read-only and leaves 1.93 GB of backup files.
05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak
05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak
03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak
05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak
05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak
05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak
03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak
05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak
Can patched dlls be included in pythongpu_windows_x86_64__cuda1131.txz?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59444 - Posted: 12 Oct 2022 | 12:22:35 UTC - in response to Message 59440.

I notice a big difference in VRAM use between various Python tasks and/or systems, eg:

- GPU with running 3 tasks simultaneously: 5.250 MB
- GPU with running 2 tasks simultaneously: 5.012 MB
- GPU with running 2 tasks simulteanously: 8.055 MB

with the third one cited above I was lucky, VRAM of the GPU is 8.142 MB

(FYI, all values including a few hundred MB for the monitor).

Does anyone else make the same experience?


more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU. so a 3090 would use more VRAM than say a 1050Ti on the same exact task. it's just the way it works when the GPU sets up the task, if the task has to scale to 10,000 cores instead of 2,000, it needs to use more memory.

____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59445 - Posted: 12 Oct 2022 | 14:02:02 UTC - in response to Message 59444.

more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU.

okay, I see. Many thanks for explaining :-)

One thing here that's a pitty is that the GPU with the largest VRAM (Quadro P5000: 16GB) has the lowest number of cores (2.560) :-(

But, as so many times: one cannot have everything in life :-)

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59446 - Posted: 12 Oct 2022 | 15:14:53 UTC - in response to Message 59445.

Is here anyone with NVIDIA A100 80GB?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59447 - Posted: 12 Oct 2022 | 16:06:25 UTC - in response to Message 59446.

Is here anyone with NVIDIA A100 80GB?


only those with $10,000 to spare to use for free on DC. so likely no one ;) lol

faster GPUs don't provide much benefit for these tasks since they are so CPU bound. sure there's a lot of VRAM on this card, and maybe you could theoretically spin up 10-15 tasks on a single card, but unless you have A LOT of CPU power and bandwidth to feed it, you're gonna hit another bottleneck before you can hope to benefit from running that many tasks.

just 6x tasks maxes out my EPYC 7443P 48 threads @ 3.9GHz.

maybe in the future the project can get these tasks to the point where they lean more on the GPU tensor cores and a more GPU only environment, but for now it's mostly a CPU environment with a small contribution by the GPU.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59449 - Posted: 13 Oct 2022 | 5:57:44 UTC
Last modified: 13 Oct 2022 | 6:03:37 UTC

just wanted to download another Python task, but BOINC event log tells me the following:

13.10.2022 07:49:38 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 1296.10MB more disk space. You currently have 32082.50 MB available and it needs 33378.60 MB.

I wonder why a Python needs 33.378 MB free disk space.
Experience has shown that a Python takes some 8 GB disk space when being processed. So how come it says it needs 33GB ?

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59450 - Posted: 13 Oct 2022 | 11:42:16 UTC - in response to Message 59449.
Last modified: 13 Oct 2022 | 11:46:15 UTC


Experience has shown that a Python takes some 8 GB disk space when being processed. So how come it says it needs 33GB ?

Check my previous post about space usage at PythonGPU startup stage.
Previously: tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (~8,13 GiB) = 16,27 GiB (Since archives(tar.gz & tar) were not deleted).
Now, after implementation of some improvements, at peak, consumption is about 13,61 GiB, and then(after startup stage) ~8,13 GiB.
In any case, it seems to require adjustment.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59451 - Posted: 13 Oct 2022 | 12:04:58 UTC - in response to Message 59450.
Last modified: 13 Oct 2022 | 12:05:21 UTC

In any case, it seems to require adjustment.

I agree

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,291,281,825
RAC: 265,143
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59452 - Posted: 13 Oct 2022 | 12:10:16 UTC - in response to Message 59441.
Last modified: 13 Oct 2022 | 12:14:24 UTC


Isn't it actually using 7za 2 times? After some testing, the conclusion I arrived to is that in principle it actually requires 2 BOINC tasks to do it

Yeah, it seems you are right.

Try use this:

<task>
<application>C:\Windows\System32\cmd.exe</application>
<command_line>/C ".\7za.exe x pythongpu_windows_x86_64__cuda1131.txz -so | .\7za.exe x -aoa -si -ttar"</command_line>
</task>

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59453 - Posted: 14 Oct 2022 | 9:21:33 UTC - in response to Message 59443.

Patching seemed to be required to run so many threads with pytorchrl as these jobs do. Otherwise windows used a lot of memory for every new thread. The script that does the patching is relatively fast. So doing it locally would not save a lot of time.

However, are you saying that after the patching some files could be deleted to further optimise memory use? If this is the case, I can look into it. These .dll_bak files? I am not very used to windows...

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59454 - Posted: 14 Oct 2022 | 9:27:18 UTC - in response to Message 59449.

Does anyone know if these requirements are estimated by BOINC and adjusted over time like completion time? or if manual adjustment is required?
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59455 - Posted: 14 Oct 2022 | 12:33:28 UTC - in response to Message 59454.

my runtime estimates have come down to basically reasonable and real levels now. so i think it will adjust on its own over time.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59456 - Posted: 14 Oct 2022 | 14:53:44 UTC - in response to Message 59455.

abouh's message 59454 was in response to a question about disk storage requirements. No, they won't adjust themselves over time: the amount of disk space required by the task is set by the server, and the amount available to the client is calculated from readings taken of the current state of the host computer. They will only change if the user adjusts the hardware or BOINC client options, or the project staff adjust the job specifications passed to the workunit generator.

One the subject of runtimes: the (calculated) runtime estimation relies on just three things:
The job speed (sent by the server in the <app_version> specification).
The job size (again set on the server)
and the Duration Correction Factor (dynamically adjusted by the client)

SPEED seems to have fallen by approaching a half over the last month, but I haven't currently got a job I can verify that for.
SIZE has remained the same while I've been monitoring it.
DCF will have fallen dramatically - mine is now below 1

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59457 - Posted: 15 Oct 2022 | 19:13:21 UTC

What can this output mean?
e00003a00008-ABOU_rnd_ppod_expand_demos25_9-0-1-RND2053

Update 464, num samples collected 118784, FPS 344
Algorithm: loss 0.1224, value_loss 0.0002, ivalue_loss 0.0113, rnd_loss 0.0307, action_loss 0.0846, entropy_loss 0.0043, mean_intrinsic_rewards 0.0421, min_intrinsic_rewards 0.0084, max_intrinsic_rewards 0.1857, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000
Episodes: TrainReward 0.0000, l 360.6000, t 649.8340, UnclippedReward 0.0000, VisitedRooms 1.0000

REWARD DEMOS 25, INTRINSIC DEMOS 25, RHO 0.05, PHI 0.05, REWARD THRESHOLD 0.0, MAX DEMO REWARD -inf, INTRINSIC THRESHOLD 1000

FRAMES TO AVOID: 0

Update 465, num samples collected 122880, FPS 347
Algorithm: loss 0.1329, value_loss 0.0002, ivalue_loss 0.0098, rnd_loss 0.0317, action_loss 0.0955, entropy_loss 0.0043, mean_intrinsic_rewards 0.0414, min_intrinsic_rewards 0.0082, max_intrinsic_rewards 0.1516, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000
Episodes: TrainReward 0.0000, l 341.3529, t 658.7952, UnclippedReward 0.0000, VisitedRooms 1.00000

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59458 - Posted: 15 Oct 2022 | 22:06:06 UTC - in response to Message 59457.

Nothing of any meaning or consequence for you. Pertinent only to the researcher.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59459 - Posted: 16 Oct 2022 | 7:34:24 UTC - in response to Message 59457.

These are just the logs of the algorithm, printing out the relevant metrics during agent training.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59460 - Posted: 17 Oct 2022 | 16:23:51 UTC

I now have had 5 tasks in a row which failed after some 2.100 secs, one after the other, within about half an hour.

https://www.gpugrid.net/result.php?resultid=33098926
https://www.gpugrid.net/result.php?resultid=33100629
https://www.gpugrid.net/result.php?resultid=33100675
https://www.gpugrid.net/result.php?resultid=33100715
https://www.gpugrid.net/result.php?resultid=33100745

anyone any idea what is the problem?

On the same host, another task has been running for 22 hours now, but I have stopped download of new tasks until it's clear what's going on.

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 114,775,136
RAC: 15,420
Level
Cys
Scientific publications
wat
Message 59461 - Posted: 17 Oct 2022 | 16:36:51 UTC
Last modified: 17 Oct 2022 | 16:37:31 UTC

I have seen continiously failed tasks starting today. According to the stderr_txt file I reckon there might be at least two, possibly related, errors.

File "C:\ProgramData\BOINC\slots\5\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\5\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\5\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'


  • KeyError: 'StateEmbeddings'
  • AttributeError: 'GWorker' object has no attribute 'batches'

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59462 - Posted: 17 Oct 2022 | 17:00:22 UTC - in response to Message 59461.

*KeyError: 'StateEmbeddings'
*AttributeError: 'GWorker' object has no attribute 'batches'

exactly same thing I notice on all my failed tasks.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59463 - Posted: 17 Oct 2022 | 17:12:31 UTC

Same here.

AttributeError: 'GWorker' object has no attribute 'batches'

mrchips
Send message
Joined: 9 May 21
Posts: 16
Credit: 1,412,539,259
RAC: 47,170
Level
Met
Scientific publications
wat
Message 59464 - Posted: 17 Oct 2022 | 17:39:44 UTC

my latest WU end with a computation error
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59465 - Posted: 17 Oct 2022 | 17:49:13 UTC - in response to Message 59460.

Your first task link shows 4 attempts at retrieving the necessary python libraries and failing.

But instead of just stopping right there it looks like it tried to compute anyway with the missing 'batches' library and all the subsequent tasks failed also becauses of the missing batches element.

Seems that the error flow map is not branching out to a proper halt early enough in the task to stop the computation and waste anymore time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59466 - Posted: 17 Oct 2022 | 17:57:48 UTC
Last modified: 17 Oct 2022 | 18:11:29 UTC

Six tasks, all in a row. Errored out. Seven now and another in the works.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59467 - Posted: 17 Oct 2022 | 18:16:48 UTC

now the same problem on another host :-(
https://www.gpugrid.net/result.php?resultid=33101249

so, as seen by other members, too: all tasks which were downloaded within the past several hours seem to be faulty.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59468 - Posted: 17 Oct 2022 | 19:31:40 UTC

I joined yesterday and have 13 tasks failed in a row, all with the
AttributeError: 'GWorker' object has no attribute 'batches'.

Is this a failed installation? Should I try to reinstall this BOINC project from scratch?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59469 - Posted: 17 Oct 2022 | 19:39:44 UTC - in response to Message 59468.


Is this a failed installation? Should I try to reinstall this BOINC project from scratch?

in view of the above said, the current tasks are probably faulty.
No need to reinstall, I guess

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59470 - Posted: 17 Oct 2022 | 21:09:31 UTC

Yes - just received and returned result 33101290, on a machine which regularly returns good results.

That was replication _6 of a WU which everyone else had failed - a sure sign that the problem was with the workunit, not the host processing it.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59471 - Posted: 18 Oct 2022 | 5:33:13 UTC

Forty-six failed WU"s? Please stop sending them until the problem is resolved.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59472 - Posted: 18 Oct 2022 | 6:40:14 UTC - in response to Message 59471.

Forty-six failed WU"s? Please stop sending them until the problem is resolved.

+ 1

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59473 - Posted: 18 Oct 2022 | 7:14:37 UTC - in response to Message 59472.

Forty-six failed WU"s? Please stop sending them until the problem is resolved.

+ 1


Sorry. After writing the post I looked at the other computer and it had downloaded another. It lasted three minutes or so. It was still in the unzipping process. I cannot understand the txt files so can someone who can check the files to see what is going on?

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59474 - Posted: 18 Oct 2022 | 7:26:12 UTC - in response to Message 59472.

+1

33 fails in a row. I'll set this project to inactive and wait for a solution.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59475 - Posted: 18 Oct 2022 | 7:28:58 UTC - in response to Message 59473.

I cannot understand the txt files so can someone who can check the files to see what is going on?

the task are wrongly configured. Don't download them for the time being.
I guess we will get some kind of "go ahead" here once the problem is solved on the project-side.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59477 - Posted: 18 Oct 2022 | 7:44:11 UTC
Last modified: 18 Oct 2022 | 7:54:40 UTC

Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. The errors is due to the specific python script of this batch, not related to the application itself. I have just fixed it, and the new jobs should be running correctly. Unfortunately, some already submitted jobs are bound to fail… I apologise for the inconvenience. They will fail briefly after starting as reported, so not a lot of compute will be wasted.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59478 - Posted: 18 Oct 2022 | 9:00:27 UTC

abouh,

could you also please make an adjustment (downwards) to the free disk space requirement of 33GB when downloading a Python task?

see my above Message 59449.

Many thanks :-)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59479 - Posted: 18 Oct 2022 | 9:21:50 UTC - in response to Message 59478.

Hello! I have checked and the disk space used by the jobs is set to 35e9 bytes.

<rsc_disk_bound>35e9</rsc_disk_bound>


I will change it first to 20e9, let me know if it helps. I can further decreased it in the future if necessary.


____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59480 - Posted: 18 Oct 2022 | 9:36:09 UTC - in response to Message 59479.

Hello! I have checked and the disk space used by the jobs is set to 35e9 bytes.

<rsc_disk_bound>35e9</rsc_disk_bound>


I will change it first to 20e9, let me know if it helps. I can further decreased it in the future if necessary.


Thanks, Abouh, for your quick reaction. The change will definitely help - at least in my case with limited disk space due to Ramdisk.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59481 - Posted: 18 Oct 2022 | 19:52:56 UTC - in response to Message 59477.

Hello, thanks you for reporting the job errors. Sorry to all, there was an error I have just fixed it, and the new jobs should be running correctly. Unfortunately, some already submitted jobs are bound to fail…


The problem is not fixed, I still get tasks that fail:
AttributeError: 'GWorker' object has no attribute 'batches'

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59482 - Posted: 18 Oct 2022 | 21:03:38 UTC

I have recieved my first new working task.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59483 - Posted: 19 Oct 2022 | 5:15:25 UTC

I wish I could get a sniff also.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59484 - Posted: 19 Oct 2022 | 8:28:42 UTC

I got another one this morning, still no luck, the task failed as all the other before. Is there something, that I have to change on my side?
This is the log file:

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
09:56:38 (11564): wrapper (7.9.26016): starting
09:56:38 (11564): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688

Everything is Ok

Size: 6410311680
Compressed: 1976180228
09:58:33 (11564): .\7za.exe exited; CPU time 111.125000
09:58:33 (11564): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
09:58:34 (11564): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
09:58:34 (11564): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII

Everything is Ok

Files: 38141
Size: 6380353601
Compressed: 6410311680
10:01:10 (11564): .\7za.exe exited; CPU time 41.140625
10:01:10 (11564): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
10:01:11 (11564): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
10:01:11 (11564): wrapper: running python.exe (run.py)
Starting!!
Windows fix!!
Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
10:05:44 (11564): python.exe exited; CPU time 2660.984375
10:05:44 (11564): app exit status: 0x1
10:05:44 (11564): called boinc_finish(195)
0 bytes in 0 Free Blocks.
442 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 6550134 bytes.
Dumping objects ->
{10837} normal block at 0x0000024DEACAF4D0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{10796} normal block at 0x0000024DEACAF310, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{10785} normal block at 0x0000024DEACAEA50, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10774} normal block at 0x0000024DEACAEF20, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{10763} normal block at 0x0000024DEACAEB30, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10752} normal block at 0x0000024DEACAF3F0, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{10671} normal block at 0x0000024DEAC990A0, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {10668} normal block at 0x0000024DEACB0A60, 8 bytes long.
Data: < {&#236;M > 00 00 7B EC 4D 02 00 00
{10030} normal block at 0x0000024DEAC9B890, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{9426} normal block at 0x0000024DEACB0600, 8 bytes long.
Data: < &#199;&#202;&#234;M > 80 C7 CA EA 4D 02 00 00
..\zip\boinc_zip.cpp(122) : {545} normal block at 0x0000024DEACB12E0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{532} normal block at 0x0000024DEACA99C0, 32 bytes long.
Data: <&#208;&#225;&#202;&#234;M &#192;&#229;&#202;&#234;M > D0 E1 CA EA 4D 02 00 00 C0 E5 CA EA 4D 02 00 00
{531} normal block at 0x0000024DEACAE4E0, 52 bytes long.
Data: < r &#205;&#205; > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x0000024DEACAE080, 43 bytes long.
Data: < p &#205;&#205; > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{521} normal block at 0x0000024DEACAE5C0, 44 bytes long.
Data: < &#205;&#205;&#225;&#229;&#202;&#234;M > 01 00 00 00 00 00 CD CD E1 E5 CA EA 4D 02 00 00
{516} normal block at 0x0000024DEACAE1D0, 44 bytes long.
Data: < &#205;&#205;&#241;&#225;&#202;&#234;M > 01 00 00 00 00 00 CD CD F1 E1 CA EA 4D 02 00 00
{506} normal block at 0x0000024DEACB39A0, 16 bytes long.
Data: < &#227;&#202;&#234;M > 20 E3 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{505} normal block at 0x0000024DEACAE320, 40 bytes long.
Data: <&#160;9&#203;&#234;M input.zi> A0 39 CB EA 4D 02 00 00 69 6E 70 75 74 2E 7A 69
{498} normal block at 0x0000024DEACB3950, 16 bytes long.
Data: <&#232;)&#203;&#234;M > E8 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{497} normal block at 0x0000024DEACB3450, 16 bytes long.
Data: <&#192;)&#203;&#234;M > C0 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x0000024DEACB3770, 16 bytes long.
Data: < )&#203;&#234;M > 98 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x0000024DEACB37C0, 16 bytes long.
Data: <p)&#203;&#234;M > 70 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{494} normal block at 0x0000024DEACB3900, 16 bytes long.
Data: <H)&#203;&#234;M > 48 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x0000024DEACB3A40, 16 bytes long.
Data: < )&#203;&#234;M > 20 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{491} normal block at 0x0000024DEACB35E0, 16 bytes long.
Data: <8&#250;&#202;&#234;M > 38 FA CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{490} normal block at 0x0000024DEACAA6E0, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{489} normal block at 0x0000024DEACB2E60, 16 bytes long.
Data: < &#250;&#202;&#234;M > 10 FA CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{488} normal block at 0x0000024DEAC9A460, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{487} normal block at 0x0000024DEACB31D0, 16 bytes long.
Data: <&#232;&#249;&#202;&#234;M > E8 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{486} normal block at 0x0000024DEACAA3E0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{485} normal block at 0x0000024DEACB3180, 16 bytes long.
Data: <&#192;&#249;&#202;&#234;M > C0 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{484} normal block at 0x0000024DEACB3A90, 16 bytes long.
Data: < &#249;&#202;&#234;M > 98 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{483} normal block at 0x0000024DEACB2DC0, 16 bytes long.
Data: <p&#249;&#202;&#234;M > 70 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x0000024DEACB3720, 16 bytes long.
Data: <H&#249;&#202;&#234;M > 48 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{481} normal block at 0x0000024DEACB3040, 16 bytes long.
Data: < &#249;&#202;&#234;M > 20 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x0000024DEACB36D0, 16 bytes long.
Data: <&#248;&#248;&#202;&#234;M > F8 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{479} normal block at 0x0000024DEACA9DE0, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{478} normal block at 0x0000024DEACB3C70, 16 bytes long.
Data: <&#208;&#248;&#202;&#234;M > D0 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{477} normal block at 0x0000024DEACA9F00, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{476} normal block at 0x0000024DEACB39F0, 16 bytes long.
Data: <&#168;&#248;&#202;&#234;M > A8 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{475} normal block at 0x0000024DEACAA2C0, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{474} normal block at 0x0000024DEACB3B80, 16 bytes long.
Data: < &#248;&#202;&#234;M > 80 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{473} normal block at 0x0000024DEACAF880, 480 bytes long.
Data: < ;&#203;&#234;M &#192;&#162;&#202;&#234;M > 80 3B CB EA 4D 02 00 00 C0 A2 CA EA 4D 02 00 00
{472} normal block at 0x0000024DEACB3AE0, 16 bytes long.
Data: < )&#203;&#234;M > 00 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{471} normal block at 0x0000024DEACB3310, 16 bytes long.
Data: <&#216;(&#203;&#234;M > D8 28 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x0000024DEACB3590, 16 bytes long.
Data: <&#176;(&#203;&#234;M > B0 28 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x0000024DEACAE160, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{468} normal block at 0x0000024DEACB3630, 16 bytes long.
Data: <&#248;'&#203;&#234;M > F8 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{467} normal block at 0x0000024DEACB2FF0, 16 bytes long.
Data: <&#208;'&#203;&#234;M > D0 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x0000024DEACB3B30, 16 bytes long.
Data: <&#168;'&#203;&#234;M > A8 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x0000024DEACB3400, 16 bytes long.
Data: < '&#203;&#234;M > 80 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{464} normal block at 0x0000024DEACB34F0, 16 bytes long.
Data: <X'&#203;&#234;M > 58 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x0000024DEACB38B0, 16 bytes long.
Data: <0'&#203;&#234;M > 30 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{462} normal block at 0x0000024DEACB3220, 16 bytes long.
Data: < '&#203;&#234;M > 10 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x0000024DEACB32C0, 16 bytes long.
Data: <&#232;&&#203;&#234;M > E8 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x0000024DEACAA500, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{459} normal block at 0x0000024DEACB3130, 16 bytes long.
Data: <&#192;&&#203;&#234;M > C0 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{458} normal block at 0x0000024DEACAE0F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{457} normal block at 0x0000024DEACB3270, 16 bytes long.
Data: < &&#203;&#234;M > 08 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{456} normal block at 0x0000024DEACB3BD0, 16 bytes long.
Data: <&#224;%&#203;&#234;M > E0 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x0000024DEACB3860, 16 bytes long.
Data: <&#184;%&#203;&#234;M > B8 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x0000024DEACB3540, 16 bytes long.
Data: < %&#203;&#234;M > 90 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x0000024DEACB2D20, 16 bytes long.
Data: <h%&#203;&#234;M > 68 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{452} normal block at 0x0000024DEACB2F50, 16 bytes long.
Data: <@%&#203;&#234;M > 40 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x0000024DEACB2FA0, 16 bytes long.
Data: < %&#203;&#234;M > 20 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x0000024DEACB3680, 16 bytes long.
Data: <&#248;$&#203;&#234;M > F8 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x0000024DEACB3810, 16 bytes long.
Data: <&#208;$&#203;&#234;M > D0 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x0000024DEACAE780, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{447} normal block at 0x0000024DEACB2E10, 16 bytes long.
Data: < $&#203;&#234;M > 18 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{446} normal block at 0x0000024DEACB2F00, 16 bytes long.
Data: <&#240;#&#203;&#234;M > F0 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x0000024DEACB2D70, 16 bytes long.
Data: <&#200;#&#203;&#234;M > C8 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x0000024DEACB33B0, 16 bytes long.
Data: <&#160;#&#203;&#234;M > A0 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{443} normal block at 0x0000024DEACB3360, 16 bytes long.
Data: <x#&#203;&#234;M > 78 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x0000024DEACB34A0, 16 bytes long.
Data: <P#&#203;&#234;M > 50 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{441} normal block at 0x0000024DEACB04C0, 16 bytes long.
Data: <0#&#203;&#234;M > 30 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x0000024DEACB08D0, 16 bytes long.
Data: < #&#203;&#234;M > 08 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x0000024DEACAA380, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{438} normal block at 0x0000024DEACB02E0, 16 bytes long.
Data: <&#224;"&#203;&#234;M > E0 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{437} normal block at 0x0000024DEACAE710, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{436} normal block at 0x0000024DEACB0010, 16 bytes long.
Data: <("&#203;&#234;M > 28 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{435} normal block at 0x0000024DEACAFF20, 16 bytes long.
Data: < "&#203;&#234;M > 00 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x0000024DEACB0880, 16 bytes long.
Data: <&#216;!&#203;&#234;M > D8 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x0000024DEACB01A0, 16 bytes long.
Data: <&#176;!&#203;&#234;M > B0 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x0000024DEACB0970, 16 bytes long.
Data: < !&#203;&#234;M > 88 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{431} normal block at 0x0000024DEACB0150, 16 bytes long.
Data: <`!&#203;&#234;M > 60 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{430} normal block at 0x0000024DEACB0E70, 16 bytes long.
Data: <@!&#203;&#234;M > 40 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{429} normal block at 0x0000024DEACB06A0, 16 bytes long.
Data: < !&#203;&#234;M > 18 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{428} normal block at 0x0000024DEACB0E20, 16 bytes long.
Data: <&#240; &#203;&#234;M > F0 20 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{427} normal block at 0x0000024DEACB20F0, 2976 bytes long.
Data: < &#203;&#234;M .\7za.ex> 20 0E CB EA 4D 02 00 00 2E 5C 37 7A 61 2E 65 78
{66} normal block at 0x0000024DEACA3AB0, 16 bytes long.
Data: < &#234;&#187;&#164;&#246; > 80 EA BB A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x0000024DEACA42D0, 16 bytes long.
Data: <@&#233;&#187;&#164;&#246; > 40 E9 BB A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000024DEACA3B50, 16 bytes long.
Data: <&#248;W&#184;&#164;&#246; > F8 57 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000024DEACA4460, 16 bytes long.
Data: <&#216;W&#184;&#164;&#246; > D8 57 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000024DEACA46E0, 16 bytes long.
Data: <P &#184;&#164;&#246; > 50 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000024DEACA4280, 16 bytes long.
Data: <0 &#184;&#164;&#246; > 30 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000024DEACA3A60, 16 bytes long.
Data: <&#224; &#184;&#164;&#246; > E0 02 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000024DEACA4140, 16 bytes long.
Data: < &#184;&#164;&#246; > 10 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000024DEACA3CE0, 16 bytes long.
Data: <p &#184;&#164;&#246; > 70 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000024DEACA4690, 16 bytes long.
Data: < &#192;&#182;&#164;&#246; > 18 C0 B6 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59485 - Posted: 19 Oct 2022 | 8:45:38 UTC

An example of that: workunit 27329338 has failed for everyone, mine after about 10%.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59486 - Posted: 19 Oct 2022 | 9:08:02 UTC - in response to Message 59481.

I am sorry, old batch jobs are still being mixed with new ones that do run successfully (I have been monitoring them). BOINC will eventually run out of bad jobs, the problems is that it attempts to run them 8 times...
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59487 - Posted: 19 Oct 2022 | 9:19:00 UTC - in response to Message 59486.

the problems is that it attempts to run them 8 times...

Look at that last workunit link. Above the list, it says:

max # of error/total/success tasks 7, 10, 6

That's configurable by the project, I think at the application level. You might be able to reduce it a bit?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59488 - Posted: 20 Oct 2022 | 5:51:53 UTC - in response to Message 59487.

Yesterday I was unable to find the specific parameter that defines the number of job attempts. I will ask the main admin. Maybe it is set for all applications.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59489 - Posted: 20 Oct 2022 | 7:43:13 UTC - in response to Message 59488.

From looking at the server code in create_work.cpp module, the parameter is pulled from the work unit template file.

You need to change the input (infile1, infile2 ...) file that feeds into the wu template file. Or directly change the wu template file.

Refer to these documents.

https://boinc.berkeley.edu/trac/wiki/JobSubmission

https://boinc.berkeley.edu/trac/wiki/JobTemplates#Inputtemplates

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59490 - Posted: 20 Oct 2022 | 7:43:50 UTC - in response to Message 59488.

Found some documentation: in https://boinc.berkeley.edu/trac/wiki/JobSubmission

The following job parameters may be passed in the input template, or as command-line arguments to create_work; the input template has precedence. If not specified, the given defaults will be used.

--target_nresults x
default 2
--max_error_results x
default 3
--max_total_results x
default 10
--max_success_results x
default 6

I can't find any similar detail for Local web-based job submission or Remote job submission, but it must be buried somewhere in there. You're not using the stated default values, so somebody at GPUGrid must have found it at least once!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59491 - Posted: 20 Oct 2022 | 9:23:24 UTC - in response to Message 59477.

Abouh wrote:

Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. ... They will fail briefly after starting as reported, so not a lot of compute will be wasted.

well, whatever "they will fail briefly after starting" means :-)

Mine are failing after 3.780 - 8.597 seconds :-(

Is there no way to call them back or delete them from the server?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59492 - Posted: 20 Oct 2022 | 13:46:39 UTC - in response to Message 59487.
Last modified: 20 Oct 2022 | 13:51:25 UTC

I see these values can be set in the app workunit template as mentioned


--max_error_results x
default 3
--max_total_results x
default 10
--max_success_results x
default 6


I have checked, and for PythonGPU and PythonGPU apps the parameters are not specified, so the default values should apply (also coherent with the info previously posted).

However, the number of times the server attempts to solve a task by sending it to a GPUGrid machine before giving up is 8. So it does not seem like it is specified by these parameters to me (shouldn't it be 3 according to the default value?).

I have asked for help to the admin server, maybe the parameters are overwritten somewhere else. Even if not for this time, it will be convenient to know to solve future issues like this one. Sorry again for the problems.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59493 - Posted: 20 Oct 2022 | 16:59:01 UTC - in response to Message 59491.

Abouh wrote:
Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. ... They will fail briefly after starting as reported, so not a lot of compute will be wasted.

well, whatever "they will fail briefly after starting" means :-)

Mine are failing after 3.780 - 8.597 seconds :-(

Is there no way to call them back or delete them from the server?


Not anymore. Anyway, after 9.45 UTC something seems to have changed. I have two Wu's (fingers crossed and touch wood) that have reached 35% in six hours.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59496 - Posted: 20 Oct 2022 | 19:04:05 UTC

can someone give me advice with regard to the following dilemma:

Until last week, on my host with 2 RTX3070 inside I could process 2 Pythons concurrently on each GPU, i.e. 4 Pythons at a time.
On device_0 VRAM became rather tight - it comes with 8.192MB, about 300-400MB were used for the monitor, and with the two Pythons the total VRAM usage was at around 8.112MB (as said: tight, but it worked fine).
On device_1 it was not that tight, since no VRAM usage for the monitor.

Since yesterday I notice that device_0 uses about 1.400MB for the monitor - no idea why. So no way to process 2 Pythons concurrently.
And no way for device_2 to run 2 Pythons either, because any additional Python beyond the one running on device_0 and the one running on device_1 would automatically start on device_0.
Hence, my question to the experts here: is there a way to tell the third Python to run on device_1, instead of device_0 ?
Or, any idea how I could lower the VRAM usage for the monitor on device_0? As said, it was much less before, all of a sudden it jumped up (I was naiv enough to connect the monitor cable to device_1 - which did, of course, not work).
Or any other ideas?

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59497 - Posted: 20 Oct 2022 | 19:44:56 UTC

Finally, WU #38 worked and was completed within two hours. Thanks, earned my first points here.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59498 - Posted: 20 Oct 2022 | 21:39:09 UTC - in response to Message 59496.

reboot the system and free up the VRAM maybe.


____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59499 - Posted: 20 Oct 2022 | 23:54:05 UTC

Browser tabs are notorious RAM eaters. Both in the cpu and gpu if you have hardware acceleration enabled in the browser.

You can use gpu_exclude statement in the cc_config.xml file to keep a gpu task off specific gpus. I do that for keeping the tasks off my fastest gpus which run other projects.

But that is permanent for the BOINC session that is booted. You would have to edit cc_config files for different sessions and boot what you need as necessary to get around this issue. Doable but cumbersome.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59500 - Posted: 21 Oct 2022 | 1:28:32 UTC - in response to Message 59499.

Browser tabs are notorious RAM eaters. Both in the cpu and gpu if you have hardware acceleration enabled in the browser.


good call. forgot the browser can use some GPU resources. that's a good thing to check.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59501 - Posted: 21 Oct 2022 | 6:09:14 UTC

many thanks, folks, for your replies regarding my VRAM problem.

I have rebooted, and the VRAM usage of device_0 was almost 2GB. No browser open, no other apps either (except GPU-Z, the MSI Afterburner, MemInfo, DUMeter, and the Windows Task Manager - these apps had been present before, too).

Now, with processing 1 Python on each GPU, the VRAM situation is as follows:
device_0: 6.034MB
device_1: 3.932MB

hence, a second Python could be run on device_1.

I know about the "gpu_exclude" thing in the cc_config.xml, but for sure this is a very cumbersome method; and I am not even sure whether in Windows a running Python survives a BOINC reboot (I think a did that once before, for a different reason, and the Python was gone).

The only thing I could try again is to open the second instance of BOINC which I had configured some time ago, with the "gpu_exclude" provision for device_0.
However, when I tried this out, everything crashed after a short while (1 or 2 hours). I did not find out why. Perhaps it was simply a coincidence and would not happen again?

It's really a pitty that with all these various configuration possibilities via cc_config (and also app_config.xml) there is no way to have a configuration available which would solve my problem :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59502 - Posted: 21 Oct 2022 | 6:39:35 UTC

I think you may have to accept the tasks are what they are. Variable because of the different parameter sets. Some may use little RAM and some may use a lot.
So you may not always be able to run doubles on your 8GB cards.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59503 - Posted: 21 Oct 2022 | 8:32:19 UTC - in response to Message 59502.
Last modified: 21 Oct 2022 | 8:39:11 UTC

I think you may have to accept the tasks are what they are. Variable because of the different parameter sets. Some may use little RAM and some may use a lot.
So you may not always be able to run doubles on your 8GB cards.

yes, meanwhile I noticed on the other two hosts which are running Pythons ATM: the amount of VRAM used varies.

No problem of course on the host with the Quadro P5000 which comes with 16GB. Out of which only some 7.5GB are being used even with 4 tasks in parallel, due to the lower number of CUDA cores of this GPU.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59504 - Posted: 21 Oct 2022 | 14:30:14 UTC

are newer tasks using more VRAM? or is there something on your system using more VRAM?

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59505 - Posted: 21 Oct 2022 | 14:45:56 UTC - in response to Message 59504.

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram

hm, I will have to find a tool that tells me :-)
Any recommendation?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59506 - Posted: 21 Oct 2022 | 15:43:50 UTC - in response to Message 59505.

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram

hm, I will have to find a tool that tells me :-)
Any recommendation?

nvidia-smi in the Terminal does nicely.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59507 - Posted: 21 Oct 2022 | 16:18:26 UTC

check here for nvidia-smi use on Windows. it's easy on Linux, but less intuitive on Windows

https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59510 - Posted: 22 Oct 2022 | 7:16:46 UTC
Last modified: 22 Oct 2022 | 7:43:42 UTC

my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(

Just noticed that a task failed after >19 hours. This is not nice :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59511 - Posted: 22 Oct 2022 | 11:40:15 UTC - in response to Message 59510.

my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(

Just noticed that a task failed after >19 hours. This is not nice :-(

I was out for a few hours, and when I came back, I noticed 2 more failed tasks (both ran for almost 3 hours before they crashed).

Whereas at the beginning of the problem, the tasks failed - as also Abouh noted - within short time so that there was not too much of waste, now these tasks fail only after several hours.

Within the past 24 hours, my hosts' total computation time of all the failing tasks was 104.526 seconds = 29 hours!

I am very much willing to support the science with my time, my equipment and my permanently increasing electricity bill as long as it makes sense (and as long as I can afford it).
FYI, my electricity costs have more than tripled since the beginning of the year, for known reasons. That's significant!

I simply cannot believe that all these faulty tasks in the big download bucket cannot be stopped, retrieved, cancelled or what ever else. It makes absolutely no sense to leave them in there and send them out to us for the next several weeks.

If the GPUGRID people cannot confirm that they are finding a way quickly to stop these faulty tasks, I have no other choice, as sorry as I am, to switch to other projects :-(

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 114,775,136
RAC: 15,420
Level
Cys
Scientific publications
wat
Message 59512 - Posted: 22 Oct 2022 | 11:45:46 UTC

For the time being, I already suspended receiving new tasks and reverted back to E@H & F@H as long as this situation with faulty tasks has been sorted out.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59513 - Posted: 22 Oct 2022 | 13:26:11 UTC
Last modified: 22 Oct 2022 | 13:29:49 UTC

Most peculiar, I have had no failed task. Seven so far.
I wish with internet problems we could also get a standby task.
Maybe they are sending these tasks to those multiple WUs crunching machines who can quickly clear up the backlog :)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59514 - Posted: 22 Oct 2022 | 15:29:36 UTC - in response to Message 59510.

I have reviewed your recent tasks and there is a mix of faulty and successful tasks. The successful ones are newer and are the only ones being submitted now.

I could not figure out how to cancel the faulty tasks earlier. However, they should be almost all if not all crunched by now.

Maybe other hosts can confirm if they are still getting tasks that crash, but I expect the problem to be solved now. For the last 2-3 days only good tasks have been sent.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59515 - Posted: 22 Oct 2022 | 15:51:18 UTC - in response to Message 59514.

@ Erich56: you have to look into the history and the reason for the crashes. I got one of the last replications from workunit 27327972 last night - but that's one that was created on 16 October, almost a week ago. it's just that the first owner hung on to it for five days and did nothing. That's not the project's fault, even if the initial error was.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59516 - Posted: 22 Oct 2022 | 15:55:21 UTC - in response to Message 59514.

For the last 2-3 days only good tasks have been sent.

thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.

Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:

https://www.gpugrid.net/result.php?resultid=33112434

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59517 - Posted: 22 Oct 2022 | 15:58:18 UTC

Likewise. Since I posted, I've received another one which is likely to go the same way, from workunit 27328975. Another 5-day no-show by a Science United user.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59518 - Posted: 22 Oct 2022 | 16:17:52 UTC

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59519 - Posted: 22 Oct 2022 | 16:37:02 UTC - in response to Message 59518.

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.


Mine are set to ten plus ten days but I still get one. This is not the reason.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59520 - Posted: 22 Oct 2022 | 16:52:27 UTC - in response to Message 59516.

For the last 2-3 days only good tasks have been sent.

thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.

Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:

https://www.gpugrid.net/result.php?resultid=33112434


When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59521 - Posted: 22 Oct 2022 | 16:57:30 UTC - in response to Message 59518.

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.


Ideally, when the task approaches the deadline it should jump into high priority mode and jump to the front of the line for task priority. But the process doesn’t always work ideally with BOINC.

But there are also many people who blindly download tasks then shut off their computer for extended periods of time.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59522 - Posted: 22 Oct 2022 | 17:10:49 UTC - in response to Message 59516.
Last modified: 22 Oct 2022 | 17:12:02 UTC

Yes, I meant fresh tasks, which would be sent out to for the first time out of 8 possible attempts.

Yes, repetitions are an issue. I understand why it was set to a relatively high value. Many machines with limited GPU memory (e.g. 2Gb) or configuration problems are in the network are fail inevitably with this tasks. That gave the experiments some error tolerance.

However, ideally I would like to be able to modify it just for the python apps momentarily for cases like this one. I could set it to 1 for a few hours so all bad tasks are process fast and then go back to 8.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59523 - Posted: 22 Oct 2022 | 18:10:21 UTC - in response to Message 59520.

Ian&Steve C. wrote:

When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.

Well, not a bad idea, if I had the time to babysit my hosts 24/7 :-)

However, this would end up with a problem rather quickly: isn't it still the case that once a certain number of downloaded tasks is being deleted, no further ones will be sent within the following 24 hours?
In fact, I remember that this was even true for failing tasks in the past, based on the assumption that there is something wrong with the host. So, in view of the many failed tasks now, I am surprised that I still get new ones within the mentioned 24 hours ban.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59524 - Posted: 22 Oct 2022 | 18:20:24 UTC

Depends on how they have set up the server software.

There are BOINC configs so that "bad actors" are put into timeout mode when they return a large number of bad results in a short time period. The 24 hour timeout you mentioned.

Once a host starts returning valid results, they are given increasing amounts of work on each scheduler request.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59525 - Posted: 23 Oct 2022 | 16:16:10 UTC

Crazy, I had another task which failed after more than 20 hours :-(

I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while.
There was another task yesterday which failed after almost 20 hours.
And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour.

My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks.
I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon).

So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-(

What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59526 - Posted: 23 Oct 2022 | 16:56:20 UTC - in response to Message 59525.

Must be a Windows thing. None of my "bad" formatted tasks run longer than ~40 minutes or so before failing out.

Yes, there are many flaws with BOINC, but unless you can develop a better solution, you will have to use what we have.

Sorry to have you leave the project.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59527 - Posted: 23 Oct 2022 | 17:57:15 UTC - in response to Message 59525.

Crazy, I had another task which failed after more than 20 hours :-(

I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while.
There was another task yesterday which failed after almost 20 hours.
And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour.

My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks.
I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon).

So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-(

What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far.



The tasks that were failing were taking around three minutes not twenty hours.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59528 - Posted: 24 Oct 2022 | 5:59:55 UTC - in response to Message 59527.


The tasks that were failing were taking around three minutes not twenty hours.

for sure NOT 3 minutes. Example here:

20 Oct 2022 | 1:19:26 UTC 20 Oct 2022 | 2:57:36 UTC Error while computing 3,780.66 3,780.66 --- Python apps for GPU hosts v4.04 (cuda1131)

so, in above example, the task failed after 1 Hr 38 mins.


20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131)
here, the task failed after 1 hr 23 mins.

but, interestingly enough, here the relation is quite different:
22 Oct 2022 | 6:41:59 UTC 22 Oct 2022 | 7:07:44 UTC Error while computing 70,694.64 70,694.64 --- Python apps for GPU hosts v4.04 (cuda1131)
the task obviously failed after 25 minutes, although runtime and CPU time as indicated would suggest >19 hrs.

These indications are somewhat unclear (to me).


Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59529 - Posted: 24 Oct 2022 | 6:41:25 UTC

You MUST absolutely ignore any reported times for cpu_time and run_time for the Python tasks.

The numbers are meaningless. BOINC is unable to correctly calculate the times because of the dual cpu-gpu nature of the tasks.

If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.

The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.

The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time.

Since I only carry a single task at any time, I report one task and receive its replacement on the same scheduler connection so I know my elapsed time is pretty close to the actual difference between sent time and reported time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59530 - Posted: 24 Oct 2022 | 7:20:40 UTC

I get one task at a time also.
Anyway, I got one failure today task 33115748. It has failed seven times already with one timed out. It is waiting to go to someone once more.

Stderr output
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
06:36:47 (12932): wrapper (7.9.26016): starting
06:36:47 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688

Everything is Ok

Size: 6410311680
Compressed: 1976180228
06:38:33 (12932): .\7za.exe exited; CPU time 100.578125
06:38:33 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
06:38:34 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
06:38:34 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII

Everything is Ok

Files: 38141
Size: 6380353601
Compressed: 6410311680
06:39:39 (12932): .\7za.exe exited; CPU time 21.781250
06:39:39 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
06:39:40 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
06:39:40 (12932): wrapper: running python.exe (run.py)
Starting!!
Windows fix!!
Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
06:44:10 (12932): python.exe exited; CPU time 1673.984375
06:44:10 (12932): app exit status: 0x1
06:44:10 (12932): called boinc_finish(195)
0 bytes in 0 Free Blocks.
554 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 4443701 bytes.
Dumping objects ->
{11071} normal block at 0x000002340B7911E0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{11030} normal block at 0x000002340B791090, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{11019} normal block at 0x000002340B791170, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{11008} normal block at 0x000002340B790FB0, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{10997} normal block at 0x000002340B790D80, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10986} normal block at 0x000002340B791020, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{10905} normal block at 0x0000023409C90AB0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {10902} normal block at 0x0000023409C8E2D0, 8 bytes long.
Data: < _ 4 > 00 00 5F 0B 34 02 00 00
{10127} normal block at 0x0000023409C909E0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{9380} normal block at 0x0000023409C8E550, 8 bytes long.
Data: < &#202;&#203; 4 > 80 CA CB 09 34 02 00 00
..\zip\boinc_zip.cpp(122) : {544} normal block at 0x0000023409C90B80, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{531} normal block at 0x0000023409C8A430, 32 bytes long.
Data: <0&#139;&#200; 4 &#208;&#134;&#200; 4 > 30 8B C8 09 34 02 00 00 D0 86 C8 09 34 02 00 00
{530} normal block at 0x0000023409C88A50, 52 bytes long.
Data: < r &#205;&#205; > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{525} normal block at 0x0000023409C88580, 43 bytes long.
Data: < p &#205;&#205; > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{520} normal block at 0x0000023409C886D0, 44 bytes long.
Data: < &#205;&#205;&#241;&#134;&#200; 4 > 01 00 00 00 00 00 CD CD F1 86 C8 09 34 02 00 00
{515} normal block at 0x0000023409C88B30, 44 bytes long.
Data: < &#205;&#205;Q&#139;&#200; 4 > 01 00 00 00 00 00 CD CD 51 8B C8 09 34 02 00 00
{505} normal block at 0x0000023409C910C0, 16 bytes long.
Data: < &#133;&#200; 4 > 10 85 C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{504} normal block at 0x0000023409C88510, 40 bytes long.
Data: <&#192; &#201; 4 input.zi> C0 10 C9 09 34 02 00 00 69 6E 70 75 74 2E 7A 69
{497} normal block at 0x0000023409C90EE0, 16 bytes long.
Data: < &&#201; 4 > 08 26 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x0000023409C91610, 16 bytes long.
Data: <&#224;%&#201; 4 > E0 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x0000023409C91C00, 16 bytes long.
Data: <&#184;%&#201; 4 > B8 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{494} normal block at 0x0000023409C90DA0, 16 bytes long.
Data: < %&#201; 4 > 90 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x0000023409C918E0, 16 bytes long.
Data: <h%&#201; 4 > 68 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{492} normal block at 0x0000023409C90D50, 16 bytes long.
Data: <@%&#201; 4 > 40 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{490} normal block at 0x0000023409C912F0, 16 bytes long.
Data: < &#201; 4 > 88 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{489} normal block at 0x0000023409C89BF0, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{488} normal block at 0x0000023409C90E40, 16 bytes long.
Data: <` &#201; 4 > 60 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{487} normal block at 0x0000023409C75300, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{486} normal block at 0x0000023409C912A0, 16 bytes long.
Data: <8 &#201; 4 > 38 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{485} normal block at 0x0000023409C8A3D0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{484} normal block at 0x0000023409C91CA0, 16 bytes long.
Data: < &#201; 4 > 10 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{483} normal block at 0x0000023409C91200, 16 bytes long.
Data: <&#232;&#255;&#200; 4 > E8 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x0000023409C91C50, 16 bytes long.
Data: <&#192;&#255;&#200; 4 > C0 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{481} normal block at 0x0000023409C91110, 16 bytes long.
Data: < &#255;&#200; 4 > 98 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x0000023409C91BB0, 16 bytes long.
Data: <p&#255;&#200; 4 > 70 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{479} normal block at 0x0000023409C91520, 16 bytes long.
Data: <H&#255;&#200; 4 > 48 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{478} normal block at 0x0000023409C8A790, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{477} normal block at 0x0000023409C90FD0, 16 bytes long.
Data: < &#255;&#200; 4 > 20 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{476} normal block at 0x0000023409C8A310, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{475} normal block at 0x0000023409C913E0, 16 bytes long.
Data: <&#248;&#254;&#200; 4 > F8 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{474} normal block at 0x0000023409C89FB0, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{473} normal block at 0x0000023409C91070, 16 bytes long.
Data: <&#208;&#254;&#200; 4 > D0 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{472} normal block at 0x0000023409C8FED0, 480 bytes long.
Data: <p &#201; 4 &#176;&#159;&#200; 4 > 70 10 C9 09 34 02 00 00 B0 9F C8 09 34 02 00 00
{471} normal block at 0x0000023409C91B10, 16 bytes long.
Data: < %&#201; 4 > 20 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x0000023409C90F80, 16 bytes long.
Data: <&#248;$&#201; 4 > F8 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x0000023409C91AC0, 16 bytes long.
Data: <&#208;$&#201; 4 > D0 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{468} normal block at 0x0000023409C88820, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{467} normal block at 0x0000023409C91660, 16 bytes long.
Data: < $&#201; 4 > 18 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x0000023409C914D0, 16 bytes long.
Data: <&#240;#&#201; 4 > F0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x0000023409C91890, 16 bytes long.
Data: <&#200;#&#201; 4 > C8 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{464} normal block at 0x0000023409C91A70, 16 bytes long.
Data: <&#160;#&#201; 4 > A0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x0000023409C90E90, 16 bytes long.
Data: <x#&#201; 4 > 78 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{462} normal block at 0x0000023409C91570, 16 bytes long.
Data: <P#&#201; 4 > 50 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x0000023409C8E960, 16 bytes long.
Data: <0#&#201; 4 > 30 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x0000023409C8E910, 16 bytes long.
Data: < #&#201; 4 > 08 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{459} normal block at 0x0000023409C89A10, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{458} normal block at 0x0000023409C8E8C0, 16 bytes long.
Data: <&#224;"&#201; 4 > E0 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{457} normal block at 0x0000023409C889E0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{456} normal block at 0x0000023409C8E7D0, 16 bytes long.
Data: <("&#201; 4 > 28 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x0000023409C8E4B0, 16 bytes long.
Data: < "&#201; 4 > 00 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x0000023409C8E820, 16 bytes long.
Data: <&#216;!&#201; 4 > D8 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x0000023409C8E780, 16 bytes long.
Data: <&#176;!&#201; 4 > B0 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{452} normal block at 0x0000023409C8E460, 16 bytes long.
Data: < !&#201; 4 > 88 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x0000023409C8E500, 16 bytes long.
Data: <`!&#201; 4 > 60 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x0000023409C8EA00, 16 bytes long.
Data: <@!&#201; 4 > 40 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x0000023409C8E5F0, 16 bytes long.
Data: < !&#201; 4 > 18 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x0000023409C8E730, 16 bytes long.
Data: <&#240; &#201; 4 > F0 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{447} normal block at 0x0000023409C884A0, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{446} normal block at 0x0000023409C8E9B0, 16 bytes long.
Data: <8 &#201; 4 > 38 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x0000023409C863C0, 16 bytes long.
Data: < &#201; 4 > 10 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x0000023409C85BF0, 16 bytes long.
Data: <&#232; &#201; 4 > E8 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{443} normal block at 0x0000023409C85A60, 16 bytes long.
Data: <&#192; &#201; 4 > C0 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x0000023409C86370, 16 bytes long.
Data: < &#201; 4 > 98 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{441} normal block at 0x0000023409C86460, 16 bytes long.
Data: <p &#201; 4 > 70 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x0000023409C862D0, 16 bytes long.
Data: <P &#201; 4 > 50 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x0000023409C859C0, 16 bytes long.
Data: <( &#201; 4 > 28 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{438} normal block at 0x0000023409C8A370, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{437} normal block at 0x0000023409C86320, 16 bytes long.
Data: < &#201; 4 > 00 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{436} normal block at 0x0000023409C885F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{435} normal block at 0x0000023409C86410, 16 bytes long.
Data: <H &#201; 4 > 48 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x0000023409C85FB0, 16 bytes long.
Data: < &#201; 4 > 20 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x0000023409C85970, 16 bytes long.
Data: <&#248; &#201; 4 > F8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x0000023409C85880, 16 bytes long.
Data: <&#208; &#201; 4 > D0 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{431} normal block at 0x0000023409C866E0, 16 bytes long.
Data: <&#168; &#201; 4 > A8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{430} normal block at 0x0000023409C86690, 16 bytes long.
Data: < &#201; 4 > 80 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{429} normal block at 0x0000023409C85F60, 16 bytes long.
Data: <` &#201; 4 > 60 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{428} normal block at 0x0000023409C858D0, 16 bytes long.
Data: <8 &#201; 4 > 38 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{427} normal block at 0x0000023409C85830, 16 bytes long.
Data: < &#201; 4 > 10 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{426} normal block at 0x0000023409C91D10, 2976 bytes long.
Data: <0X&#200; 4 .\7za.ex> 30 58 C8 09 34 02 00 00 2E 5C 37 7A 61 2E 65 78
{65} normal block at 0x0000023409C86550, 16 bytes long.
Data: < &#234;&#215;W&#247; > 80 EA D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000023409C85920, 16 bytes long.
Data: <@&#233;&#215;W&#247; > 40 E9 D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000023409C860F0, 16 bytes long.
Data: <&#248;W&#212;W&#247; > F8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000023409C85C90, 16 bytes long.
Data: <&#216;W&#212;W&#247; > D8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000023409C85B50, 16 bytes long.
Data: <P &#212;W&#247; > 50 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000023409C85DD0, 16 bytes long.
Data: <0 &#212;W&#247; > 30 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000023409C86230, 16 bytes long.
Data: <&#224; &#212;W&#247; > E0 02 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000023409C85B00, 16 bytes long.
Data: < &#212;W&#247; > 10 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000023409C860A0, 16 bytes long.
Data: <p &#212;W&#247; > 70 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000023409C85C40, 16 bytes long.
Data: < &#192;&#210;W&#247; > 18 C0 D2 57 F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59531 - Posted: 24 Oct 2022 | 8:00:24 UTC
Last modified: 24 Oct 2022 | 8:01:04 UTC

@ Erich56, @ KAMasud

Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.

It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye.

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,714,356,441
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59532 - Posted: 24 Oct 2022 | 8:32:36 UTC
Last modified: 24 Oct 2022 | 8:33:19 UTC

Hello,

all my tasks behave in the same way: they advance to 4% and then have no activity. I have to cancel them after several hours of idle time.

Example: https://www.gpugrid.net/result.php?resultid=33109419

The machine is equipped with a GTX1080, 32GB of RAM and 16GB of swap.

Thank you for your help
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59533 - Posted: 24 Oct 2022 | 8:40:18 UTC - in response to Message 59532.

Example: https://www.gpugrid.net/result.php?resultid=33109419

OSError: [WinError 1455] Le fichier de pagination est insuffisant pour terminer cette op&#233;ration. Error loading "D:\BOINC\slots\3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.

Your page file still isn't large enough.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59534 - Posted: 24 Oct 2022 | 10:28:55 UTC - in response to Message 59531.

@ Erich56, @ KAMasud

Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.

It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye.

high Richard,

I do know how to put a hyperlink into my texts. In my previous posting, my main intention was to show the time the task was received and lateron sent back after failure. So I didn't deem it necessary to hyperlink the task itself.
But you are right: there may be more details for you guys which could be of interest, no doubt. So in the future, whenever referring to a given task, I'll hyperlink it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59535 - Posted: 24 Oct 2022 | 10:36:56 UTC - in response to Message 59529.

Keith wrote:

The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time.

Since I only carry a single task at any time, I report one task and receive its replacement on the same scheduler connection so I know my elapsed time is pretty close to the actual difference between sent time and reported time.


what you say in the last paragraph, is also true for my hosts.

I agree to what you wrote in the paragraph before. That's why in my posting, I cited the times where the tasks were received and then reported back, after failure. These were the actual runtimes, no "sitting" time included.


KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59536 - Posted: 24 Oct 2022 | 12:09:09 UTC - in response to Message 59531.
Last modified: 24 Oct 2022 | 12:16:05 UTC

@ Erich56, @ KAMasud

Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.

It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye.


Richard, could you please make a different thread and teach us all the tricks? We would be very grateful.
Looked it up in Wikipedia and ended with not much. There should be some page on Boinc itself, can you give the link?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59537 - Posted: 24 Oct 2022 | 12:33:48 UTC - in response to Message 59536.

There should be some page on Boinc itself, can you give the link?

There is. To the top left of the text entry box where you type a message (just below the word 'Author' on the grey divider line), there's a link:

Use BBCode tags to format your text

That opens in a separate browser window (or tab), so you can refer to it while composing your message. Use the 'quote' button below this message to see how I've made the link work here.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59538 - Posted: 24 Oct 2022 | 14:23:15 UTC - in response to Message 59528.

Erich, you still misunderstand. With these Python tasks you can't just rely on the times that you reported the task. since it looks like your system sat on these tasks for some time before reporting it. you also can't rely on the runtime counters since it's been known for a long time that they are incorrect due to the multithreaded nature of them (more cores = more reported runtime), and that amount that they are incorrect will vary system to system. the ONLY accurate way to check is to look at the timestamps in the stderr output.


for sure NOT 3 minutes. Example here:

20 Oct 2022 | 1:19:26 UTC 20 Oct 2022 | 2:57:36 UTC Error while computing 3,780.66 3,780.66 --- Python apps for GPU hosts v4.04 (cuda1131)

so, in above example, the task failed after 1 Hr 38 mins.


link to this one: http://www.gpugrid.net/result.php?resultid=33105596

from the stderr:
04:45:25 (5200): wrapper (7.9.26016): starting
04:45:25 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)
04:48:28 (5200): .\7za.exe exited; CPU time 179.609375
04:48:28 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
04:48:29 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
04:48:29 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)
04:49:00 (5200): .\7za.exe exited; CPU time 30.109375
04:49:00 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
04:49:02 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
04:49:02 (5200): wrapper: running python.exe (run.py)
Starting!!
...
[lots of traceback errors here]
[then..]
04:55:55 (5200): python.exe exited; CPU time 3570.937500
04:55:55 (5200): app exit status: 0x1
04:55:55 (5200): called boinc_finish(195)


just look at the timestamps. you started processing the task at 4:45 and boinc finished it at 4:55. it only actually ran for 10 mins. you either waited ~1hr before starting this tasks, or waited ~1hr before reporting it. it is very common behavior for the BOINC client to extend your project communication time when it detects a computation error.


20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131)
here, the task failed after 1 hr 23 mins.


this task here: http://www.gpugrid.net/result.php?resultid=33105606

04:56:11 (9280): wrapper (7.9.26016): starting
...
05:06:33 (9280): called boinc_finish(195)


same story here, only ran for 10 minutes.

but, interestingly enough, here the relation is quite different:
22 Oct 2022 | 6:41:59 UTC 22 Oct 2022 | 7:07:44 UTC Error while computing 70,694.64 70,694.64 --- Python apps for GPU hosts v4.04 (cuda1131)
the task obviously failed after 25 minutes, although runtime and CPU time as indicated would suggest >19 hrs.

These indications are somewhat unclear (to me).


this task here: http://www.gpugrid.net/result.php?resultid=33111849

08:42:24 (6280): wrapper (7.9.26016): starting
...
09:05:40 (6280): called boinc_finish(195)


this one ran for about 23mins. there was less of a delay in starting or reporting this one.

I hope this clarifies what you should be looking at to make accurate determinations about run time.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59539 - Posted: 24 Oct 2022 | 15:06:02 UTC

BOINC itself makes it even easier to check the numbers. In the root of the BOINC data folder, you'll find a plain text file called

job_log_www.gpugrid.net.txt

It contains one line for each successful task, newest at the bottom.

Here's one of my recent shorties - task 33104232

1666088826 ue 1354514.775804 ct 1290.400000 fe 1000000000000000000 nm e00001a00003-ABOU_rnd_ppod_expand_demos25_17-0-1-RND1967_0 et 541.083257 es 0

That's very dense, but we're only interested in two numbers:

ct 1290.400000
et 541.083257

That's "CPU time" and "elapsed time", respectively. You'll see that both of those have been converted to 1,290.40 in the online report.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59540 - Posted: 24 Oct 2022 | 15:45:49 UTC

ok guys, many thanks for clarification :-) I now got it :-)

So, as it seems, none of my tasks were running for 23 hours or so before they failed; which is very good news!

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59541 - Posted: 24 Oct 2022 | 16:56:24 UTC - in response to Message 59537.
Last modified: 24 Oct 2022 | 17:07:42 UTC

There should be some page on Boinc itself, can you give the link?

There is. To the top left of the text entry box where you type a message (just below the word 'Author' on the grey divider line), there's a link:

Use BBCode tags to format your text

That opens in a separate browser window (or tab), so you can refer to it while composing your message. Use the 'quote' button below this message to see how I've made the link work here.



Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?
[quote]27329068[quote]
I do not think it will work though.
Forget that I even asked.
[list]27329068[list]
Yuck. How do I get that WU number to pop up?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59542 - Posted: 24 Oct 2022 | 18:13:15 UTC - in response to Message 59541.

Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?

OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.

I've got it open in another tab. The address bar in that tab is showing the full url:

https://www.gpugrid.net/result.php?resultid=33104232

First, I type the word task into the message.
task

Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:

{url}task{/url}

Then, I put an equals sign in the first bracket, and add that address from the other tab:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}

Finally, I double-click on the number, copy it, and paste it in the central section:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}

I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:

task 33104232

In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59543 - Posted: 25 Oct 2022 | 11:00:02 UTC - in response to Message 59542.
Last modified: 25 Oct 2022 | 11:06:59 UTC

Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?

OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.

I've got it open in another tab. The address bar in that tab is showing the full url:

https://www.gpugrid.net/result.php?resultid=33104232

First, I type the word task into the message.
task

Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:

{url}task{/url}

Then, I put an equals sign in the first bracket, and add that address from the other tab:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}

Finally, I double-click on the number, copy it, and paste it in the central section:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}

I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:

task 33104232

In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.


At least our brains are at par. Maybe the steamships I worked on.

task 27329068

Let us give it a try.
I re-edited. :)

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59544 - Posted: 25 Oct 2022 | 11:13:00 UTC - in response to Message 59543.

Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?

OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.

I've got it open in another tab. The address bar in that tab is showing the full url:

https://www.gpugrid.net/result.php?resultid=33104232

First, I type the word task into the message.
task

Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:

{url}task{/url}

Then, I put an equals sign in the first bracket, and add that address from the other tab:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}

Finally, I double-click on the number, copy it, and paste it in the central section:

{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}

I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:

task 33104232

In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.


At least our brains are at par. Maybe the steamships I worked on.

task 27329068

Let us give it a try.
I re-edited. :)


Anyway, as you all can read the txt files being generated get confused about completion time. I watch the Task Manager. As soon as the sawtooth goes, I know.
It took three minutes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59545 - Posted: 25 Oct 2022 | 12:16:07 UTC - in response to Message 59544.

This has been reported and explained many times in this thread. These tasks report CPU time as elapsed time. That’s why it’s so far off. Since these tasks are multithreaded, CPU time gets greatly inflated.

A normal GPU task might use 100% of a single core, in that case CPU time matches pretty closely to elapsed time. That’s what we are used to seeing.

However, these tasks are multithreaded. Using 32 threads or more for processing (and constrained by your physical hardware if less than that). When it’s multithreaded, CPU time is equal to the SUM of the CPU time from all the threads that processed that WU. as a simplistic example, say you have a 4-thread CPU and the task used all threads at 75% utilization for 5 minutes. CPU time (in seconds) would be 4*0.75*300=900 seconds. Now you can see how adding more cores can greatly increase this number.

Looking at the start and stop timestamps of your task, it ran for about 5 mins.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59546 - Posted: 25 Oct 2022 | 12:33:46 UTC - in response to Message 59545.

These tasks report CPU time as elapsed time.

Actually, that's not quite right.

The report (made in sched_request_www.gpugrid.net.xml) is accurate - it's after it lands in the server that it's filed in the wrong pocket.

I've got a couple of tasks finishing in the next hour / 90 minutes - I'll try to catch the report for one of them.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59547 - Posted: 25 Oct 2022 | 12:44:47 UTC - in response to Message 59546.

It’s correct. You just misinterpreted my perspective.

I was talking about what the website reports to us. Not what we report to the server.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59548 - Posted: 25 Oct 2022 | 13:31:48 UTC - in response to Message 59547.

Anyway, I caught one just to clarify my perspective.

<result>
<name>e00021a01361-ABOU_rnd_ppod_expand_demos25_20-0-1-RND2109_0</name>
<final_cpu_time>151352.900000</final_cpu_time>
<final_elapsed_time>54305.405065</final_elapsed_time>
<exit_status>0</exit_status>
<state>5</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>403</version_num>
<plan_class>cuda1131</plan_class>
<final_peak_working_set_size>4950069248</final_peak_working_set_size>
<final_peak_swap_size>17198002176</final_peak_swap_size>
<final_peak_disk_usage>10656485468</final_peak_disk_usage>
<app_version_num>403</app_version_num>

That matches what it says in the job log:

ct 151352.900000 et 54305.405065

But not what is says on the website:

task 33116901

I'm going on about it, because if it was a problem in the client, we could patch the code and fix it. But because it happens on the server, it's not even worth trying. Precision in language matters.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59549 - Posted: 25 Oct 2022 | 17:26:09 UTC - in response to Message 59529.


If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.

The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.


I have a question: Currently, I'm running a Python task with 1 core and one GPU.
Would the crunching time decrease, if I allocate more cores to this tasks? 2 cores equals 50%, 4 cores equals 25% ?
I know how to tweak the app_config.xml, but I want to ask before I waist time with tinkering.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59550 - Posted: 25 Oct 2022 | 17:41:45 UTC - in response to Message 59549.


If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.

The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.


I have a question: Currently, I'm running a Python task with 1 core and one GPU.
Would the crunching time decrease, if I allocate more cores to this tasks? 2 cores equals 50%, 4 cores equals 25% ?
I know how to tweak the app_config.xml, but I want to ask before I waist time with tinkering.


I assume you're talking about the app_config settings when you say "allocate". as a reminder, these settings do not change how much CPU is used by the app. the app uses whatever it needs no matter what settings you choose (up to physical constraints). the only way you can constrain CPU use is to do something like run a virtual machine with less cores allocated to it than the host has. otherwise the app still has full access to all your cores, and if you monitor cpu use by the various processes you'll observe this.

if you're not running any other tasks (other CPU projects) at the same time, then changing the CPU allocation likely wont have any impact to your completion times since it's already using all of your cores.
____________

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59551 - Posted: 25 Oct 2022 | 17:53:52 UTC
Last modified: 25 Oct 2022 | 17:54:22 UTC

Thanks for the fast reply. I'm running MCM from WCG on my machine in parallel. I will do a short test and suspend all other tasks. The question is: Will Python add more cores to this task if the other cores become available?

My system: Ryzen 9 5950X, NVidia RTX 3060 Ti, 64 GB RAM, WIN 10

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59552 - Posted: 25 Oct 2022 | 18:02:57 UTC - in response to Message 59551.
Last modified: 25 Oct 2022 | 18:03:40 UTC

don't think of it in that sense.

these tasks will spawn 32+ processes no matter how many cores you have or how much you allocate in BOINC. these processes need to be serviced by the CPU. if you have many processes and not enough threads to service them all, they will need to wait in the priority queue against all other processes.

increasing the BOINC CPU allocation for the Python tasks, will stop processing by other competing BOINC CPU tasks, leaving more free available resources to the Python processes. so they will get the opportunity use more CPU in a shorter amount of time, but probably not much different total CPU time. meaning the tasks should run faster since they aren't competing with the other CPU work.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59553 - Posted: 25 Oct 2022 | 18:21:40 UTC - in response to Message 59550.

...the only way you can constrain CPU use is to do something like run a virtual machine with less cores allocated to it than the host has. otherwise the app still has full access to all your cores, and if you monitor cpu use by the various processes you'll observe this.

if you're not running any other tasks (other CPU projects) at the same time, then changing the CPU allocation likely wont have any impact to your completion times since it's already using all of your cores.

however, you guys recently stated that best way is not to run any other projects while processing Python tasks.
I can confirm. A week ago, I ran one LHC-ATLAS task, 2-core (virtual machine) together with 2 Pythons (1 each per GPU), and after a while the system crashed.
Since then, only Pythons are being processed - no crashes so far.

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59554 - Posted: 25 Oct 2022 | 19:03:43 UTC

Well,
CPU load was 100 % before with 30 MCM tasks running in parallel. Now, only the Python task is running and the CPU load is between 40 and 75 %. GPU load has not changed and is between 18 and 22 % like before.

Looks like it is progressing faster than before ;-)

GS
Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59555 - Posted: 25 Oct 2022 | 20:03:23 UTC

Found a nice balance between MCM and Python tasks. Now I run 7 MCM and 1 Python tasks and the CPU load is about 99 %.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59556 - Posted: 26 Oct 2022 | 7:25:39 UTC

there was a task which ran for about 20 hours and yielded a credit of 45.000

https://www.gpugrid.net/result.php?resultid=33117861

how come ?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59557 - Posted: 26 Oct 2022 | 8:38:39 UTC - in response to Message 59556.

Currently, credits are not defined by execution time, but by the maximum possible compute effort. In particular for these AI experiments which consist on training AI agents, a maximum number of learning steps for the AI agents is defined as a target. That means that the agent interacts with its simulated environment and then learns from these interactions a certain amount of time.

However, if some condition is met earlier, the task ends. There is a certain amount of randomness in the learning process, but the amount of credits is defined by the upper bound of training steps, independently of whether the task finished earlier or not. That is the amount of learning steps that the agent would do if the early stopping condition is never met.

In general the condition is met more often by earlier RL agents in the populations that by later ones. Also can vary from experiment to experiment. Locally the task last on average 10-14h.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59558 - Posted: 26 Oct 2022 | 13:08:43 UTC - in response to Message 59552.

don't think of it in that sense.

these tasks will spawn 32+ processes no matter how many cores you have or how much you allocate in BOINC. these processes need to be serviced by the CPU. if you have many processes and not enough threads to service them all, they will need to wait in the priority queue against all other processes.

increasing the BOINC CPU allocation for the Python tasks, will stop processing by other competing BOINC CPU tasks, leaving more free available resources to the Python processes. so they will get the opportunity use more CPU in a shorter amount of time, but probably not much different total CPU time. meaning the tasks should run faster since they aren't competing with the other CPU work.


I have a question also. Maybe Richard might understand better. I run CPDN tasks also which are very few and far between. So I gave zero resources to Moo Wrapper and ran it in parallel. No CPDN task then Moo would send me WUs.
Now with GPUgrid tasks, this is not the case. These tasks do not register in Boinc as a task for some reason. If I am crunching a GPUgrid task then I SHOULD not get a Moo task. That is the correct procedure but what happened when I shifted from CPDN to here, I was running one GPUgrid(on all cores) task as well as twelve Moo tasks. That is thirteen tasks. I am not worried about if it can be done but why is this happening?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59559 - Posted: 26 Oct 2022 | 13:24:54 UTC - in response to Message 59558.

Without having full details of how your copy of BOINC is configured, and how the tasks from each project are configured to run (in particular, the resource assignment for each task type) it's impossible to say.

This may help:



That machine has six CPU cores, but it's only running five tasks. That's because BOINC has committed 3+1+0.5+0.5+1 = 6 cores, and there are none left. If one of the GPU applications had been configured to require 2.99 CPUs, or 0.49 CPUs, the total core allocation would have fallen "below six", and BOINC's rules say that another task can be started.

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,714,356,441
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59560 - Posted: 26 Oct 2022 | 14:44:56 UTC - in response to Message 59533.

Example: https://www.gpugrid.net/result.php?resultid=33109419

OSError: [WinError 1455] Le fichier de pagination est insuffisant pour terminer cette op&#233;ration. Error loading "D:\BOINC\slots\3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.

Your page file still isn't large enough.



I need to push swap size file up to 32GB but now it's OK.

Even if the GPU activity rate is low and the Python task does not respect the number of threads allocated to it... no problem, go ahead science !

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59563 - Posted: 29 Oct 2022 | 8:04:38 UTC - in response to Message 59559.
Last modified: 29 Oct 2022 | 8:08:29 UTC

Without having full details of how your copy of BOINC is configured, and how the tasks from each project are configured to run (in particular, the resource assignment for each task type) it's impossible to say.

This may help:



That machine has six CPU cores, but it's only running five tasks. That's because BOINC has committed 3+1+0.5+0.5+1 = 6 cores, and there are none left. If one of the GPU applications had been configured to require 2.99 CPUs or 0.49 CPUs, the total core allocation would have fallen "below six", and BOINC's rules say that another task can be started.

Boinc version 7.20.2. Stock, out of the box. If there is a thread where I can learn mischief let me know.
It is stock Boinc and I have allocated 100% of resources to GPUGrid plus 0% resources to Moo Wrapper. In case of no task from GPUGrid, I can get Moo tasks.
I am in a hot, arid part of South Asia so I have to keep an eye on Temperatures. I don't want a puddle of plastic. Having too many cores is not an advantage in my case.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 59573 - Posted: 9 Nov 2022 | 23:32:44 UTC

According to my work in progress listings, I received this WU listed in progress: https://www.gpugrid.net/result.php?resultid=33134063 but it is non existent on the computer. Since it doesn't exist, I can't abort it or anything so the project will have to remove it from my queue and reassign it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59576 - Posted: 11 Nov 2022 | 15:32:21 UTC
Last modified: 11 Nov 2022 | 15:34:27 UTC

on one of my hosts a Python has now been running for almost 3 times as long as all the "long" ones before.
There is CPU activity, also GPU activity + VRAM usage in the usual range. Also RAM.
The slot in the project folder is also filled with some 8,25GB.

Still I am not sure whether this task maybe has hung up itself some way.
Could this still be a valid task, or should I better terminate it?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59577 - Posted: 11 Nov 2022 | 16:58:30 UTC - in response to Message 59576.

on one of my hosts a Python has now been running for almost 3 times as long as all the "long" ones before.
There is CPU activity, also GPU activity + VRAM usage in the usual range. Also RAM.
The slot in the project folder is also filled with some 8,25GB.

Still I am not sure whether this task maybe has hung up itself some way.
Could this still be a valid task, or should I better terminate it?

I now looked up the task history - it failed on 7 other hosts.
So I'd better cancel it :-)

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59578 - Posted: 11 Nov 2022 | 19:38:42 UTC - in response to Message 59577.
Last modified: 11 Nov 2022 | 19:38:56 UTC

Can you check whether wrapper_run.out changes and number of samples collected?
There should be a config file in slot directory that contains start sample number and end sample number. You can use subtraction to determine target number of samples.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59579 - Posted: 12 Nov 2022 | 1:12:40 UTC - in response to Message 59578.
Last modified: 12 Nov 2022 | 1:12:54 UTC

File name is conf.yaml
parameters are
start_env_steps and target_env_steps.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59580 - Posted: 12 Nov 2022 | 6:37:03 UTC - in response to Message 59579.
Last modified: 12 Nov 2022 | 6:59:46 UTC

File name is conf.yaml
parameters are
start_env_steps and target_env_steps.

I had already abortet the task mentioned above when I now read your posting.

But I looked up the figures in a task which is in process right now. It says:

32start_env_steps: 25000000
sticky_actions: true
target_env_steps: 50000000

so what exactly do the figures mean: in this case, about half of the task has been processed?

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59581 - Posted: 12 Nov 2022 | 11:31:19 UTC - in response to Message 59580.

I think it means that previous crunchers have already crunched up to 25000000 steps and your workunit will continue to 50000000.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59582 - Posted: 12 Nov 2022 | 17:11:52 UTC - in response to Message 59581.
Last modified: 12 Nov 2022 | 17:15:31 UTC

Yes this is exactly what it means. Most parameters in the config file define the specifics of the agent training process.

In this case these parameters specify that the initial AI agent will be loaded from a previous agent that already took 25_000_000M steps in his simulated environment, so it is not taking completely random actions. The agent will continue the process, interacting 25_000_000M more times with the environment and learning from its successes and failures.

Other parameter specify the type of algorithm used for learning, the number of copies of the environment used to speed up the interactions (32), and many other things.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59583 - Posted: 15 Nov 2022 | 20:17:08 UTC

what I noticed within the past few days is that the runtime of the Pythons has increased.
Whereas until short time ago on all of my hosts some tasks made it below 24 hrs, now every task lasts > 24 hrs.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59584 - Posted: 15 Nov 2022 | 23:56:22 UTC - in response to Message 59583.

Try to reduce number of simultaneously running workunits.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59585 - Posted: 16 Nov 2022 | 2:48:12 UTC

I've rarely had a short runner in weeks. Now almost all tasks take more than 24 hours.

Missing by a few minutes usually which is disheartening.

But I won't be reducing the compute load since I only run a single Python task on each host along with multiple other projects work.

I just accept the lesser credit while still maintaining a full load of my other projects which aren't impacted too much by the single Python task.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59586 - Posted: 16 Nov 2022 | 7:16:47 UTC

What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59587 - Posted: 16 Nov 2022 | 12:59:30 UTC - in response to Message 59586.

What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company.

this is exactly my observation, too.

Asghan
Send message
Joined: 30 Oct 19
Posts: 6
Credit: 405,900
RAC: 0
Level

Scientific publications
wat
Message 59589 - Posted: 22 Nov 2022 | 9:35:28 UTC
Last modified: 22 Nov 2022 | 9:36:04 UTC

The only thing I noticed:
The biggest lie for the new python tasks is "0.9 CPU".
My current task, and the one before, were/is using 20 out of my 24 cores on my 5900X...

Please support the Tensor Cores as soon as possible, my 4090 is getting bored :/

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59590 - Posted: 22 Nov 2022 | 17:10:26 UTC - in response to Message 59589.

Some errored tasks crash because someone was trying to run them on GTX 680 with 2 gb vram.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59591 - Posted: 23 Nov 2022 | 8:08:36 UTC
Last modified: 23 Nov 2022 | 8:28:39 UTC

task 33145039
Example. Seven computers have crashed this work unit. Richard or someone else who can read the files can find out why.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59592 - Posted: 23 Nov 2022 | 9:42:09 UTC - in response to Message 59591.

Hello! I just checked the failed submissions of this jobs, and in each case it failed for a different reason.


1. ERROR: Cannot set length for output file : There is not enough space on the disk
2. DefaultCPUAllocator: not enough memory (GPU memory?)
3. RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (GPU not supported by cuda?)
4. Failed to establish a new connection (connection failed to install the only pipeable dependency)
5. AssertionError. assert ports_found (some port configuration missing?)
6. BrokenPipeError: [WinError 232] The pipe is being closed (for some reason multiprocessing broke, I am guessing not enough memory since windows uses much more memory than linux when running multiprocessing)
7. lbzip2: Cannot exec: No such file or directory

It is quite unlikely that it fails 7 times, but each machine has a different configuration it is very difficult to cover all cases. That is the reason why jobs are submitted multiple times after failure, to be fault tolerant.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59593 - Posted: 23 Nov 2022 | 10:05:58 UTC - in response to Message 59589.
Last modified: 23 Nov 2022 | 10:34:57 UTC

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59594 - Posted: 23 Nov 2022 | 13:15:31 UTC - in response to Message 59593.

you'd need to find a way to get the task loaded fully to the GPU. the environment training that you're doing on CPU, can you do that same processing on the GPU? probably.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59598 - Posted: 26 Nov 2022 | 7:39:46 UTC - in response to Message 59593.

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

-----------------
Thank you.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59599 - Posted: 26 Nov 2022 | 7:39:54 UTC - in response to Message 59593.

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

-----------------
Thank you.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59627 - Posted: 21 Dec 2022 | 3:43:17 UTC

I'm being curious here...

These Python apps don't seem to report their virtual memory usage accurately on my hosts. They show 7.4GB while my commit charge shows 52BG+ (with 16GB RAM).

They report more CPU time than the amount of time it actually took my hosts to finish them.

They're also causing the CPU usage to max out around 50% when there are no other CPU tasks running, no matter what my boinc manager CPU usage limit is.

Could anyone please explain this to a confused codger?
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59639 - Posted: 22 Dec 2022 | 7:50:45 UTC - in response to Message 59627.
Last modified: 22 Dec 2022 | 7:52:43 UTC

These tasks are a bit particular, because they use multiprocessing and also interleave stages of CPU utilisation with stages of GPU utilisation.

The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). That, together with the fact that the tasks use a python library for machine learning called PyTorch, accounts for the large virtual memory (every thread commits virtual memory when the package is imported, even though it is not later used).

The switching between CPU and GPU phases could be causing the CPU's to be at 50%.

Other hosts have found configurations to improve resource utilisation by running more than one task, some configurations are shared in this forum.
____________

gemini8
Avatar
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,234,559,169
RAC: 132,659
Level
Phe
Scientific publications
watwat
Message 59640 - Posted: 22 Dec 2022 | 8:14:50 UTC - in response to Message 59639.

The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads).

I don't think so.
The CPU-time should be correct, it's just that the overall runtime is faulty.
You can easily see that if you compare the runtime to the send and receive times.
____________
- - - - - - - - - -
Greetings, Jens

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59652 - Posted: 25 Dec 2022 | 20:27:03 UTC
Last modified: 25 Dec 2022 | 20:52:26 UTC

Feliz navidad, amigos!

Odd thing about Pythons using GPU. They seem incoherent about their time reportage.

I see them finish in around 10-12 hours but the cpu time is much greater. Often around 80 hrs.

Looking at the properties of a running task I see:


Application
Python apps for GPU hosts 4.04 (cuda1131)
Name
e00007a01485-ABOU_rnd_ppod_expand_demos30_2_test2-0-1-RND0975
State
Running
Received
12/25/2022 1:49:40 AM
Report deadline
12/30/2022 1:49:40 AM
Resources
0.988 CPUs + 1 NVIDIA GPU
Estimated computation size
1,000,000,000 GFLOPs
CPU time
3d 00:26:17
CPU time since checkpoint
00:04:01
Elapsed time
09:17:00
Estimated time remaining
07:04:18
Fraction done
96.160%
Virtual memory size
6.91 GB
Working set size
1.66 GB
Directory
slots/0
Process ID
16952
Progress rate
10.440% per hour
Executable
wrapper_6.1_windows_x86_64.exe
________________

Notice the cpu time vs the elapsed time.

I also see that the estimate of time remaining runs ridiculously high.

Though the wrapper claims to be 0.988 CPUs it is actually using up to 70% on machines with fewer cores when nothing else is running. The more cpu time slices it can get of any available threads the faster it seems to run up to the point of max usable by the wrapper. It also eats up as much as 50GB of commit charge (total memory) and more.

It seems to be immune to the BOINC manager limits on cpu usage, so it can easily peg your processor usage with other projects running. Setting max processor usage at 25-33% should ensure that the WUs finish within the 105,000 point bonus time frame if you are running other projects simultaneously.

Another observation I made was that dual graphics cards don't seem to work with this wrapper on my hosts. GPU1 always seemed to stay at 4% or so while GPU0's WU ran at a reduced speed.
This limitation is coincidentally identical to my experience on MLC@Home. In addition, I am seeing that very modest GPUs (1050s and such) are just as effective as the latest models at producing points when the cpu runs unconstrained. That was also noted at MLC from my experience.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59653 - Posted: 25 Dec 2022 | 21:43:07 UTC

This has been commented on extensively in this thread if you had read it.

The cpu_time is not calculated correctly because BOINC has no mechanisim to deal with these one of a kind tasks who are using machine learning and are of dual cpu-gpu nature.

The tasks spawn 32 processes on your cpu and will use a significant amount of cpu resources and main and virtual memory.

They sporadically use your gpu in brief spurts of computation before passing computation back to the cpu.

As long as the gpu has 4 GB of VRAM, the tasks can be run on very moderate gpu hardware.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59655 - Posted: 26 Dec 2022 | 10:42:37 UTC - in response to Message 59652.
Last modified: 26 Dec 2022 | 10:43:03 UTC

You can create app_config.xml with

<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact/>
</app>
</app_config>

It should make it display more accurate time estimation.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59656 - Posted: 26 Dec 2022 | 11:21:30 UTC - in response to Message 59655.

You can create app_config.xml with
<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact/>
</app>
</app_config>

It should make it display more accurate time estimation.

for me, this worked well with all the ACEMD tasks. It does NOT work with the Pythons.
I am talking about Windows; maybe it works with Linux, no idea.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59657 - Posted: 26 Dec 2022 | 18:53:30 UTC
Last modified: 26 Dec 2022 | 18:54:20 UTC

As I mentioned BOINC has no idea how to display these tasks because they do not fit in ANY category that BOINC is coded for.

So no existing BOINC mechanism can properly display the cpu usage or get even close with time estimations.

Does not matter whether the host is Windows, Mac or Linux based. The OS has nothing to do with the issue.

The issue is BOINC.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59658 - Posted: 26 Dec 2022 | 22:51:48 UTC

Thanks for the tips guys. Sorry about being captain obvious there, I just rejoined this project and should have caught up on the thread before reporting my observations.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59676 - Posted: 4 Jan 2023 | 8:09:36 UTC

Right now: ~ 14.200 "unsent" Python tasks in the queue.

I guess it will take a while until they all are processed.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59677 - Posted: 4 Jan 2023 | 22:00:56 UTC
Last modified: 4 Jan 2023 | 22:39:03 UTC

Looks good but getting some bad WUs.

Had 3 errors in a row on the same host and thought it was something about the machine until I checked to see who else ran them. They were on their last chance runs.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59678 - Posted: 5 Jan 2023 | 7:00:23 UTC - in response to Message 59677.

Could you provide the name of the task? I will take a look at the errors.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59680 - Posted: 5 Jan 2023 | 12:37:46 UTC - in response to Message 59678.

Could you provide the name of the task? I will take a look at the errors.

In case you don't want to wait until he reads your posting for replying - look here:

http://www.gpugrid.net/results.php?hostid=602606

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59681 - Posted: 5 Jan 2023 | 13:41:30 UTC - in response to Message 59680.

It seems some of them need more pagefile than usual

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59682 - Posted: 5 Jan 2023 | 18:00:53 UTC

I see the tasks that my host and some others crashed were successfully finished eventually. Sorry to have assumed before the fact.
I suspect my host had errors because it was running Mapping Cancer Markers concurrent with Python. Once I suspended WCG tasks it has run error free.

Thanks to Eric for providing the host link.
Sorry for the misinformation.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59683 - Posted: 5 Jan 2023 | 18:40:57 UTC

abouh,

can you confirm the section of code that the task spends the most time on?

is it here?

while not learner.done():

learner.step()


I'm still trying to track down why AMD systems use so much more CPU than Intel systems. I even went so far as to rebuild the numpy module against MKL (yours is using the default BLAS, not MKL or OpenBLAS). and injecting it into the environment package. but it made no difference again. probably because it looks like numpy is barely used in the code anyway and not in the main loop.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59684 - Posted: 6 Jan 2023 | 17:48:38 UTC - in response to Message 59682.

Pop Piasa wrote:

I suspect my host had errors because it was running Mapping Cancer Markers concurrent with Python. Once I suspended WCG tasks it has run error free.

I had made the same experience when I began crunching Pythons.
Best is not to run anything else.

theBigAl
Send message
Joined: 4 Oct 22
Posts: 4
Credit: 2,289,945,306
RAC: 85,635
Level
Phe
Scientific publications
wat
Message 59685 - Posted: 7 Jan 2023 | 1:05:39 UTC - in response to Message 59684.
Last modified: 7 Jan 2023 | 1:52:20 UTC

I've been running WCG (CPU only tasks though) and GPUGrid concurrently past few days and its working out fine so far.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59686 - Posted: 7 Jan 2023 | 18:38:56 UTC - in response to Message 59685.

My Intel hosts seem to have no problems, only my Ryzen5-5600X. Same memory in all of them. That is indeed odd because theBigAl is using the exact same processor without errors. one difference is that theBigAl is running Windows 11 where I have Win 10 on my host.

Erich56 is spot-on that Python likes to have the machine (or virtual machine) to itself for these integrated gpu tasks. I have seen them drop from around 14 hrs. to under twelve hrs. completion time by stopping concurrent projects.

How does one get two or more of these to run with multiple gpus in a host?
I took a second card back out of one of my hosts because it just slowed it down running these.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59689 - Posted: 8 Jan 2023 | 0:35:56 UTC - in response to Message 59686.

Because of the unique issue with virtual memory on Windows compared to Linux, I don't know if running more than a single task is doable, let alone running multiple gpus.

And yes, it is possible to run more than a single gpu on these tasks in Linux.

My teammate Ian has been running 3X concurrently on his 2X 3060's and now 2X RTX A4000 gpus.

theBigAl
Send message
Joined: 4 Oct 22
Posts: 4
Credit: 2,289,945,306
RAC: 85,635
Level
Phe
Scientific publications
wat
Message 59690 - Posted: 8 Jan 2023 | 3:04:22 UTC - in response to Message 59686.

My Intel hosts seem to have no problems, only my Ryzen5-5600X. Same memory in all of them. That is indeed odd because theBigAl is using the exact same processor without errors. one difference is that theBigAl is running Windows 11 where I have Win 10 on my host.


I dont know if it'll help but I have allocated 100Gb of virtual memory swap for the computer which is probably an overkill but doesn't hurt to try if you got the space.

I'll up that to 140Gb when I'll eagerly receive my 3060ti tomorrow and testing out if it can run multiple GPU tasks on Win11 (probably not and even if it does it'll run a lot slower since it'll be CPU bound then)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59691 - Posted: 8 Jan 2023 | 6:39:09 UTC - in response to Message 59689.

Keith Myers wrote:

Because of the unique issue with virtual memory on Windows compared to Linux, I don't know if running more than a single task is doable, let alone running multiple gpus.

On my host with 1 GTX980ti and 1 Intel i7-4930K I run 2 Pythons concurrently.
On my host with 2 RTX3070 and 1 i9-10900KF I run 4 Pythons concurrently.
On my host with 1 Quadro P5000 and 2 Xeon E5-2667 v4 I run 4 Pythons concurrently.

All Windows 10.
No problems at all (except that I don't make it below 24hours with any task)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59693 - Posted: 8 Jan 2023 | 14:20:12 UTC
Last modified: 8 Jan 2023 | 14:30:40 UTC

abouh,

as a followup to my previous post, I think I've narrowed down the issue in your script/setup that causes unnecessarily high CPU use for newer and high core count hosts. I was able to reduce the CPU use from 100% to 40%, and speed up task execution at the same time (due to much less scheduling conflicts with so many running processes). I was able to connect with someone who understands these tools and they helped me figure out what's wrong, I'll paraphrase their comments and findings below.

the basic answer is that the thread pool isn't configured correctly for wandb. (Its only configured for parser so its unlikely correctly limiting amount of threads - and likely there's a soft error somewhere)

Line 447 & 448
spawns threads, but doesn't specify them anywhere.

Line 373
defines how many thread processes will be used; but it doesnt seem to work correctly. it's defined as 16, but changing this value does nothing, and on my 64-core system, 64 processes are spun up for each running task. in addition to the 32 agents spawned. a 16-core CPU will spin up 16+32 processes, and so on. trying to run 10 concurrent tasks on my 64-core system results in a staggering 960 processes being run, this seems to cripple the system and it slows things down as a result.

https://docs.wandb.ai/guides/track/advanced/distributed-training
(by end of the page, shows how they are configured correctly)

do you get the error log in the npz output? is this send back with tasks? I tried to read this file but could not, it's compressed or encrypted. it may contain more information about what is setup wrong with the wandb mp pool.

I was able to work around this issue by setting environment variables to put hard limits on the number of processes used. i edited run.py at line 445 with:

NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS


but it's not a proper fix. I added further workarounds to make this a little more persistent for myself, but it will need to be fixed by the project to fix for everyone. proper fix would be investigating what is the soft error in the error log file, with full access to the job (which we don't have - and we cannot implement proper mp without it).

you could band-aid fix with the same edits I have for run.py, but It might cause issues if you have less than 8 threads I guess? or maybe it's fine since the script launches so many processes anyway. I'm still testing to see if there's a point where less threads on run.py actually slows the task down. on these fast CPUs I might be able to run as little as 4.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59695 - Posted: 8 Jan 2023 | 16:55:28 UTC

Thanks for keep digging into this high cpu usage bug Ian. I missed the last convos on your other thread at STH I guess.

Hope that abouh can implement a proper fix. That should increase the return rate dramatically I think.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59697 - Posted: 9 Jan 2023 | 8:10:53 UTC - in response to Message 59683.
Last modified: 9 Jan 2023 | 10:56:29 UTC

Hello Ian,

learner.step()


Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive)

Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right?

Thanks a lot for your help!

If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59698 - Posted: 9 Jan 2023 | 13:39:07 UTC - in response to Message 59697.

Hello Ian,

learner.step()


Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive)

Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right?

Thanks a lot for your help!

If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed


removing wandb could be a start, but it's also possible that it's not the sole cause of the problem.

are you able to see any soft errors in the logs from reported tasks?

do you have any higher core count (32+ cores) systems in your lab or available to test on?

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59699 - Posted: 10 Jan 2023 | 6:06:09 UTC - in response to Message 59693.
Last modified: 10 Jan 2023 | 6:07:30 UTC

ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.

I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.



NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS


Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?

Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59701 - Posted: 10 Jan 2023 | 11:58:55 UTC

I think I may have encountered the Linux version of the Windows virtual memory problem.

I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten.

It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time.

I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59702 - Posted: 10 Jan 2023 | 12:26:44 UTC - in response to Message 59701.

There are programs that can display what files use most space on disk. For example K4DirStat

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59703 - Posted: 10 Jan 2023 | 13:11:55 UTC - in response to Message 59699.

ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.

I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.



NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS


Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?

Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).


i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure.

let me get back to you if you could print some errors from within the run.py script.

and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59704 - Posted: 10 Jan 2023 | 13:31:49 UTC - in response to Message 59701.

I think I may have encountered the Linux version of the Windows virtual memory problem.

I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten.

It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time.

I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory.


probably need some more context about the system.

how much disk drive space does it have?
how much of that space have you allowed BOINC to use?
how many Python tasks are you running?
Do you have any other projects running that cause high disk use?

each expanded and running GPUGRID_Python slot looks to take up about 9GB. (the 2.7GB archive gets copied there, expanded to ~6.xGB, and and archive remains in place). so that's 9 GB per task running + ~5GB for the GPUGRID project folder depending on if you've cleaned up old apps/archives or not. if your project folder is carrying lots of the old apps, a project reset might be in order to clean it out.


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59706 - Posted: 10 Jan 2023 | 14:35:35 UTC - in response to Message 59704.

how much disk drive space does it have?
how much of that space have you allowed BOINC to use?
how many Python tasks are you running?
Do you have any other projects running that cause high disk use?

This is what BOINC sees:



It's running on a single 512 GB M.2 SSD. Much of that 200 GB is used by the errant project, and is dormant until they get their new upload server fettled.
One Python task - the other GPU is excluded by cc_config.
Some Einstein GPU tasks are just finishing. Apart from that, just NumberFields (lightweight integer maths).

Within the next half hour, the Einstein tasks will vacate the machine. I'll try one Python, solo, as an experiment, and report back.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59707 - Posted: 10 Jan 2023 | 15:23:53 UTC - in response to Message 59706.

So it looks like you’ve set BOINC to be allowed use to the whole drive or so? Or only 50%?

The 234GB “used by other programs” seems odd. Are you using this system to store a large amount of personal files too? Do you know what is taking up nearly half of the drive that’s not BOINC related?

If you’re not aware of what’s taking up that space. Check /var/log/, I’ve had it happen that large amounts of errors filling up the syslog and kern.log files and filling the disk.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59709 - Posted: 10 Jan 2023 | 15:57:46 UTC - in response to Message 59707.
Last modified: 10 Jan 2023 | 16:20:03 UTC

The machine is primarily a BOINC cruncher, so yes - BOINC is allowed to use what it wants. I'm suspicious about those 'other programs', too - especially as my other Linux machine shows a much lower figure. The main difference between then is that I did an in-situ upgrade from Mint 20.3 to 21 not long ago, and the other machine is still at 20.3 - I suspect there may be a lot of rollback files kept 'just in case'.

And yes, I'm suspicious of the logs too - especially BOINC writing to the systemd journal, and that upgrade. Next venue for an exploration.

I've been watching the disk tab in my peripheral vision, as the test task started. 'Free space for BOINC' fell in steps through 26, 24, 22, 21, 20 as it started, and has stayed there. Now at around 10% progress / 1 hour elapsed.

Should have mentioned - machine has 64 GB of physical RAM, in anticipation of some humongous multi-threaded tasks to come.

Edit - new upload server won't be certified as 'fit for use' until tomorrow, so I've started Einstein again.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59711 - Posted: 10 Jan 2023 | 22:37:51 UTC - in response to Message 59709.

What other project?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59712 - Posted: 10 Jan 2023 | 22:50:16 UTC - in response to Message 59711.

What other project?

Name redacted to save the blushes of the guilty!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59713 - Posted: 11 Jan 2023 | 8:04:22 UTC

Looks like this was a false alarm - the probe task finished successfully, and I've started another. Must have been timeshift all along.

The nameless project is still hors de combat. The new server is alive and ready, but can't be accessed by BOINC.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59714 - Posted: 11 Jan 2023 | 8:52:11 UTC - in response to Message 59713.

You mean ithena?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59737 - Posted: 18 Jan 2023 | 16:57:24 UTC - in response to Message 59703.

ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.

I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.



NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS


Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?

Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).


i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure.

let me get back to you if you could print some errors from within the run.py script.

and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go.


4 seems to be working fine.

abouh, if removing wandb doesn't fix the problem, then adding the env variarables listed above with num_threads = 4 will probably be a suitable workaround for everyone. probably not many hosts with less than 4 threads these days.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59757 - Posted: 18 Jan 2023 | 23:18:39 UTC - in response to Message 59737.

Excuse the dumb question but would that then mean the app would only spin up 4 threads?
On Windows, I have manually capped the app to 24 threads and it uses all of them, my Linux box capped at 6 threads has half the threads idling.
Both seem to take about the same time though, what is the Windows app doing with all the threads that the Linux app does not need?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59761 - Posted: 19 Jan 2023 | 8:54:22 UTC - in response to Message 59737.
Last modified: 19 Jan 2023 | 9:13:41 UTC

I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users.

I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps:

1. Wait for the current batch to finish (currently 3,726 tasks)
2. Then I will update PyTorchRL library.
3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion)
4. Send again a big batch with the new code to PythonGPU.

The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue.

I will make a post once I submit the Beta tasks.
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59762 - Posted: 19 Jan 2023 | 12:57:01 UTC - in response to Message 59761.

Can http://www.gpugrid.net/apps.php link be put next to Server status link?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59764 - Posted: 19 Jan 2023 | 13:48:56 UTC - in response to Message 59761.

I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users.

I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps:

1. Wait for the current batch to finish (currently 3,726 tasks)
2. Then I will update PyTorchRL library.
3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion)
4. Send again a big batch with the new code to PythonGPU.

The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue.

I will make a post once I submit the Beta tasks.


thanks abouh! looking forward to testing out the new batch.


____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59775 - Posted: 19 Jan 2023 | 21:38:55 UTC - in response to Message 59762.

Can http://www.gpugrid.net/apps.php link be put next to Server status link?


I'd like to see this change in the website design also.

Would be much easier for access than having to manually edit the URL or find the one apps link in the main project JoinUs page.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59780 - Posted: 22 Jan 2023 | 21:12:44 UTC - in response to Message 59762.
Last modified: 22 Jan 2023 | 21:49:18 UTC

Can http://www.gpugrid.net/apps.php link be put next to Server status link?


You might want to repost that on the wish list thread so it's there when the webmaster gets around to updating the site.

I fear they may be too busy at this time. I went ahead and put a link in my browser until then.

Thanks for posting that page link.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59782 - Posted: 23 Jan 2023 | 19:09:05 UTC - in response to Message 59676.

Right now: ~ 14.200 "unsent" Python tasks in the queue.

I guess it will take a while until they all are processed.


now down to less than 500. these went much quicker than I anticipated. only about 3 weeks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59783 - Posted: 23 Jan 2023 | 20:30:42 UTC
Last modified: 23 Jan 2023 | 20:32:19 UTC

So what again is going to be the status of the expected new application?

Beta to start with?

Removal of wandb?

New nthreads value?

New job_xxx.xml file?

New compilation for Ada devices?

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59784 - Posted: 23 Jan 2023 | 20:43:34 UTC

Will the new app be fine on 1 CPU core or will it still require many? on my Windows box atm I have to manually allocate 24 cores to the WU so it does not get starved with other projects running at the same time.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59787 - Posted: 23 Jan 2023 | 22:04:11 UTC - in response to Message 59784.
Last modified: 23 Jan 2023 | 22:05:21 UTC

Pretty sure you are confusing cores with processes. The app will still spin out 32 python processes. Processes are not cores.

But from testing of the modified job.xml file, the new app will probably need as few as 4 cores/threads to run.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59788 - Posted: 24 Jan 2023 | 4:04:42 UTC - in response to Message 59787.

There are two separate mechanisms with this app spinning up multiple processes/threads. The fix will only reduce one of them. Since each task is training 32x agents at once, those 32 processes still spin up. The fix I helped uncover only addresses the unnecessary extra CPU usage from the n-cores extra processes spinning up. I’ve been running with those capped at 4. And it seems fine.

About Ada support, since this app is not really an “app” as it’s not a compiled binary, but just a script, it works fine with Ada already according to some other users running it on their 40-series cards. It’s the Acemd3 app that needs to be recompiled for Ada.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59789 - Posted: 24 Jan 2023 | 7:51:35 UTC - in response to Message 59783.
Last modified: 24 Jan 2023 | 7:52:16 UTC

The job_xxx.xml will also remain the same, since the instructions are as simple as:

- 1. unpack the conda python environment with all required dependencies.
- 2. run the provided python script.
- 3. return result files.

So I am only changing the provided python script.

As Ian mentioned, it is not a compiled app. The only difference is that the packed conda environment contains cuda10 (10.2.89) or cuda11 (11.3.1) depending on the host GPU.

Is that enough to support ADA GPUs?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59790 - Posted: 24 Jan 2023 | 8:10:17 UTC

Only 75 jobs in the queue! Thank you all for your support :)

I imagine will be all processed today. So as I mentioned in an earlier post, the next steps will be the following:

1- I will release a new version of our Reinforcement Learning library (https://github.com/PyTorchRL/pytorchrl), used in the python scripts to instantiate and train the AI agents.

2- I will send a small batch of PythonGPUBeta jobs with the new python script and also using the new version of the library.

3- If everything goes well, start sending PythonGPU tasks again.

I am interested in your feedback regarding whether or not the new scripts configuration is helpful in terms of efficiency. In my machine seems to work fine.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59791 - Posted: 24 Jan 2023 | 9:25:00 UTC - in response to Message 59787.
Last modified: 24 Jan 2023 | 9:27:24 UTC

Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads.
I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects.

What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59793 - Posted: 25 Jan 2023 | 5:13:08 UTC - in response to Message 59791.

Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads.
I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects.

What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file.


I, second that.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59794 - Posted: 25 Jan 2023 | 11:12:06 UTC
Last modified: 25 Jan 2023 | 11:12:28 UTC

I just released the new version of the python library and sent the beta tasks.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59795 - Posted: 25 Jan 2023 | 11:36:43 UTC - in response to Message 59793.
Last modified: 25 Jan 2023 | 11:43:57 UTC

Is there any BOINC specifiable WU parameter for that? I could not find it but I would also like to avoid to the hosts having to manually change configuration if possible
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59796 - Posted: 25 Jan 2023 | 13:43:09 UTC - in response to Message 59795.

Use this
<app_config>
<app>
<name>PythonGPU</name>
<plan_class>cuda1131</plan_class>
<gpu_versions>
<cpu_usage>8</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
</app>
</app_config>

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59797 - Posted: 25 Jan 2023 | 14:14:12 UTC
Last modified: 25 Jan 2023 | 14:15:17 UTC

Just grabbed one of the beta units and it still says Running (0.999 CPUs and 1 GPU) but it seems to be fluctuating between 50% and 100% load on my 32-thread CPU.
If the app is spinning up a ton of processes that need their own threads can the app reflect that and allocate however many threads are needed, please? so for example it should say "Running (32 CPUs and 1 GPU)" or however many it needs.

Would simplify things and I assume cut down on failed units from users who do not know the app spins up more than one process and run it on a single thread with other apps taking up the remainder.

Thanks

Edit after an initial 100% utilisation spike it's now settled down at around 30% - 40% CPU utilisation.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59799 - Posted: 25 Jan 2023 | 14:23:04 UTC - in response to Message 59796.
Last modified: 25 Jan 2023 | 14:52:11 UTC

But this is on the client side.

On the server side I see I can adjust these parameters for a given app: https://boinc.berkeley.edu/trac/wiki/JobIn

I am open to implement both solutions:

1- Force from the server side that host have more than 1 cpu, 4-8 for example (the tasks spawn 32 python threads but not 32 cpus are required to run them successfully). In case that is possible, but on the server I could not find any option to specify that so far..

2- Specify that 32 processes are being created. I can add it to the logs, but where else can I mention it so users are aware?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59800 - Posted: 25 Jan 2023 | 19:13:40 UTC

I don't see any parameter in the jobin page that allocates the number of cpus the task will tie up.

I don't know how the cpu resource is calculated. Must be internal in BOINC.

Richard Haselgrove probably knows the answer.

It varies among projects I've noticed. I think it is calculated internally in BOINC based on client benchmarks rating and the rsc_fpops_est value the work generator assigns tasks.

The user has been able to override the project default values with their own values via the app_config mechanismm.

But these values don't actually control how an app runs. Only the science app determines how much resources the task takes.

The cpu_usage value is only a way to help the client determine how many tasks can be run for scheduling purposes and how much work should be downloaded.

I'm currently running one of the beta tasks and it either runs faster or the workunit is smaller than normal. Probably the latter being beta.

I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process.

I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59801 - Posted: 25 Jan 2023 | 19:34:05 UTC - in response to Message 59800.
Last modified: 25 Jan 2023 | 19:34:47 UTC



I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process.

I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file.


as you said earlier in your comment, the cpu_use only tells BOINC how much is being used. it does not exert any kind of "control" over the application directly.

the previous tasks spun up a run.py child process for every core. these would be linked to the parent process. you can see them in htop.

I have not been able to get any of these beta tasks myself (i got some very early morning before I got up, but they errored because of my custom edits) to see what might be going on. but there might be a problem with them still, some other users that got them seem to have errored.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59802 - Posted: 25 Jan 2023 | 21:18:12 UTC

I reset the project on all hosts prior to the release of the beta tasks to start with a clean slate.

I have one of the beta tasks running well so far. 6.5 hrs in so far at 75% completion.

GPUGRID 1.12 Python apps for GPU hosts beta (cuda1131) e00001a00027-ABOU_rnd_ppod_expand_demos29_betatest-0-1-RND7327_1 06:22:55 (15:21:33) 240.67 79.210 78d,21:06:03 1/30/2023 3:14:52 AM 0.998C + 1NV (d0) Running High P. Darksider

I looked at this tasks in htop and it is different than before. I am not talking about the 32 spawned python processes. I was referring to 3 separate run.py process PID's that are using about 20% cpu each besides the main one.



I hadn't configured my app_config.xml for the PythonGPUbeta before I picked up the task so I ended up with the default 0.998C core usage value rather than my normal 3.0 cpu value I have for the regular Python on GPU tasks.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59803 - Posted: 25 Jan 2023 | 22:04:06 UTC - in response to Message 59802.
Last modified: 25 Jan 2023 | 22:07:33 UTC

what you're showing in your screenshot is exactly what I saw before. the "green" processes are representative of the child processes. before, you would have a number of child threads in the same amount as the number of cores. on my 16-core system there would be 16 children, on the 24-core system there was 24 children, on the 64 core system there was 64 children. and so on, for each running task.

if you move the selected line but pushing the down arrows or select one of the child processes with the cursor, you should see the top line as white text, which is the parent main process. this is all normal.

check my screenshots from this message: https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59239
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59804 - Posted: 25 Jan 2023 | 22:24:59 UTC - in response to Message 59800.

I don't see any parameter in the jobin page that allocates the number of cpus the task will tie up.

I don't know how the cpu resource is calculated. Must be internal in BOINC.

Richard Haselgrove probably knows the answer.

You're right - it doesn't belong there. It will be set in the <app_version>, through the plan class - see https://boinc.berkeley.edu/trac/wiki/AppPlanSpec.

And to amplify Ian's point - not only does BOINC not control the application, it merely allows the specified amount of free resource in which to run.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59805 - Posted: 25 Jan 2023 | 23:12:32 UTC - in response to Message 59804.

Since these apps aren't proper BOINC MT or multi-threaded apps using a MT plan class, you wouldn't be using the <max_threads>N [M]</max_threads> parameter.

Seems like the proper parameter to use would be the <cpu_frac>x</cpu_frac> one.

Do you concur?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59806 - Posted: 25 Jan 2023 | 23:18:39 UTC

Bunch of the standard Python 4.03 versioned tasks have been going out and erroring out. I've had five so far today.

Problems in the main learner step with the pytorchl packages.

https://www.gpugrid.net/result.php?resultid=33268830

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59807 - Posted: 26 Jan 2023 | 2:52:42 UTC - in response to Message 59806.

Maybe this might help Abou with the scripting, I'm too green at Python to know.

https://stackoverflow.com/questions/58666537/error-while-feeding-tf-dataset-to-fit-keyerror-embedding-input

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59808 - Posted: 26 Jan 2023 | 3:06:26 UTC

I was allocated two tasks of "ATM: Free energy calculations of protein-ligand binding v1.11 (cuda1121)" and both of them were cancelled by the server in transmission. What are these tasks about and why were they cancelled?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59809 - Posted: 26 Jan 2023 | 4:43:05 UTC - in response to Message 59808.

The researcher cancelled them because they recognized a problem with how the package was put together and the tasks would fail.

So better to cancel them in the pipeline rather than waste download bandwidth and the cruncher's resources.

You can thank them for being responsible and diligent.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59810 - Posted: 26 Jan 2023 | 4:49:01 UTC

I successfully ran one of the beta Python tasks after the first cruncher errored out the task.

https://www.gpugrid.net/result.php?resultid=33268305

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59811 - Posted: 26 Jan 2023 | 7:48:45 UTC - in response to Message 59800.
Last modified: 26 Jan 2023 | 8:06:21 UTC

The beta tasks were of the same size as the normal ones. So if they run faster hopefully the future PythonGPU tasks will too.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59812 - Posted: 26 Jan 2023 | 7:52:53 UTC - in response to Message 59806.

Thank you very much for pointing it out. Will look at the error this morning!
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59813 - Posted: 26 Jan 2023 | 13:26:50 UTC
Last modified: 26 Jan 2023 | 14:16:29 UTC

finally got some more beta tasks and they seem to be running fine. and now limited to only 4 threads on the main run.py process.

but i did notice that VRAM use has increased by about 30%. tasks are now using more than 4GB on the GPU, before it was about 3.2GB. was that intentional?

are these beta tasks going to be the same as the new batch? beta is running fine but the small amount of new ones going out non-beta seem to be failing.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59814 - Posted: 26 Jan 2023 | 13:50:56 UTC - in response to Message 59809.

I never blamed anyone. Just asked a question for my own knowledge. Anyway, Thank you. Now I wish I could get a task.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59815 - Posted: 26 Jan 2023 | 15:26:43 UTC - in response to Message 59813.

...
but i did notice that VRAM use has increased by about 30%. tasks are now using more than 4GB on the GPU, before it was about 3.2GB. was that intentional?
...

this is definitely bad news for GPUs with 8GB VRAM, like the two RTX3070 in my case. Before, I could run 2 tasks each GPU. It became quite tight, but it worked (with some 70-100MB left on the GPU the monitor is connected to).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59816 - Posted: 26 Jan 2023 | 15:37:07 UTC
Last modified: 26 Jan 2023 | 15:40:53 UTC

Yes, these latest beta tasks use a little bit more GPU memory. The AI agent has a bigger neural network. Hope it is not too big and most machines can still handle it.

What about number of threads? Is it any better?

I also fixed the problems with the non-beta (queue was empty but I guess some failed jobs were added again to it after the new software version was released). Let me know if more errors occur please.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59817 - Posted: 26 Jan 2023 | 16:09:33 UTC - in response to Message 59816.
Last modified: 26 Jan 2023 | 16:38:39 UTC

i have 4 of the beta tasks running. the number of threads looks good. using 4 threads per task as specified in the run.py script.

i just got an older non-beta task resend, and it's working fine so far (after I manually edited the threads again.

but the setup with beta tasks seems viable to push out to non-beta now.

about VRAM use. so far, it seems when they first get going, they are using about 3800MB, but rises over time. at about 50% run completion the tasks are up to ~4600MB each. not sure how high they will go. the old tasks did not have this increasing VRAM as the task progressed behavior. is it necessary? or is it leaking and not cleaning up?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59819 - Posted: 27 Jan 2023 | 8:47:47 UTC - in response to Message 59817.
Last modified: 27 Jan 2023 | 9:03:35 UTC

Great! very helpful feedback Ian thanks.

Since the scripts seem to run correctly I will start sending tasks to PythonGPU app with the current python script version.

In parallel, I will look into the VRAM increase running a few more tests in PythonGPUbeta. I don't think it is a memory leak but maybe there is way for a more efficient memory use in the code. I will dig a bit into it. Will post updates on that.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59820 - Posted: 27 Jan 2023 | 10:11:43 UTC

Got one of the new Betas, it's using about 28% average of my 16core 5950x in Windows 11 so roughly 9 threads?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59821 - Posted: 27 Jan 2023 | 10:23:36 UTC - in response to Message 59820.

The scripts still spawn 32 python threads. But I think before with wandb and maybe without fixing some environ variables even more were spawned.

However, note that not 32 cores are necessary to run the scripts. Not sure what is the optimal number but much lower than 32.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59822 - Posted: 27 Jan 2023 | 11:50:54 UTC

Yea it definitely uses less overall CPU time that before, capped the apps at 10 cores now which seems like the sweet spot to allow me to also run other apps.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59823 - Posted: 27 Jan 2023 | 15:41:59 UTC
Last modified: 27 Jan 2023 | 15:44:27 UTC

task 33269102

Eight times and it was a success. Does someone want to Post Mortem it?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59824 - Posted: 27 Jan 2023 | 20:54:29 UTC - in response to Message 59819.

Great! very helpful feedback Ian thanks.

Since the scripts seem to run correctly I will start sending tasks to PythonGPU app with the current python script version.

In parallel, I will look into the VRAM increase running a few more tests in PythonGPUbeta. I don't think it is a memory leak but maybe there is way for a more efficient memory use in the code. I will dig a bit into it. Will post updates on that.


seeing up to 5.6GB VRAM use per task. but it doesnt seem consistent. some tasks will go up to ~4.8, others 4.5, etc. there doesnt seem to be a clear pattern to it.

the previous tasks were very consistent and always used exactly the same amount of VRAM.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59825 - Posted: 28 Jan 2023 | 13:19:44 UTC

yesterday I downloaded and started 2 Pythons on my box with the Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 inside.

What I realized after some time was that the progress bars in the BOINC manager became more and more different.
Finally, one task got finished after 24 hrs + a few minutes (how nice, thus missing the <24 hrs bonus), the other task now is at 29,920%.
What I notice now, with only this one task running, is: no GPU utilization, just CPU.
Any idea how come?
I guess this task is invalid and I should abort it, right?

BTW: with the other task which worked fine I could not see any increasing VRAM usage. It stayed at about 3.5GB all time long.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59826 - Posted: 28 Jan 2023 | 15:00:20 UTC - in response to Message 59825.
Last modified: 28 Jan 2023 | 15:03:47 UTC

Old low(er) VRAM use tasks are still going out.

The old tasks have “test” in the WU name, and have the same VRAM use, and high CPU use as before.

The new tasks have “exp” in the name, have less CPU used, but more VRAM.

And the new windows app could be acting differently than the Linux version.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59827 - Posted: 28 Jan 2023 | 15:32:20 UTC

thanks for the hint regarding "old" and "new" tasks.

The 2 which I downloaded yesterday were "new" ones with "exp" in the name.

Right now, I have running 4 new ones in parallel (I was surprised that they were downladed while the server status page is has been showing "0 unsent" for quite a while).
According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).

VRAM use at this point is 9.180 MB (including whatever the monitor uses).

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59828 - Posted: 28 Jan 2023 | 15:54:00 UTC - in response to Message 59827.

According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).

is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59829 - Posted: 28 Jan 2023 | 16:12:24 UTC

as a reference, this is what it's looking like running 3 tasks on 4x A4000s. a good amount of variance in VRAM use. not consistent and I'm not sure if it increases over time, or some tasks just require more than others. but definitely more than before and different behavior than before.



____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59830 - Posted: 28 Jan 2023 | 17:42:19 UTC - in response to Message 59828.


is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?

Yes, use nvidia-smi which is installed by the Nvidia drivers.

It is located here in Windows.

C:\Program Files\NVIDIA Corporation\NVSMI

Just open a command window and navigate there and execute:

nvidia-smi

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59831 - Posted: 28 Jan 2023 | 18:43:54 UTC - in response to Message 59830.
Last modified: 28 Jan 2023 | 18:46:38 UTC


is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?

Yes, use nvidia-smi which is installed by the Nvidia drivers.

It is located here in Windows.

C:\Program Files\NVIDIA Corporation\NVSMI

Just open a command window and navigate there and execute:

nvidia-smi


he might look here too. your location is reported to be on older installs.

C:\Windows\System32\DriverStore\FileRepository\nvdm*\nvidia-smi.exe

i think he needs to include the extension. but yes.

nvidia-smi.exe
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59832 - Posted: 28 Jan 2023 | 20:04:37 UTC

thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI

However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.

But anyway, having been able to watch the progress bar in the BOINC Manager, by now I can clearly tell the following:

just to explain how I startet out with the Pythons last year when they were introduced:
I spoofed the GPU which gave me the ability to run 4 Pythons simultaneously.
With the hardware:
Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 (16GB VRAM) and 256GB system RAM
this was performing fine, over all the months.

Now, when running 4 tasks simultaneously, I notice that the 2 tasks running on "device 0" are about 3 times faster than the 2 tasks running on "device 1".
Which seems to indicate very clearly that the 2 tasks on "device 1" are NOT utilizing the GPU.

Since I made no changes neither in the hardware, nor in the software, nor in any relevant settings vis-a-avis before, the reason for this behaviour must be related to the code of the new Pythons :-( All 4 task are "new" ones.
Or does anyone have any other ideas?

BTW: on the other host with the 2 RTX3070 inside, so far I got downloaded and started 3 Pythons, however they are from the "old" series. And all three are running with the same speed, i.e. utilizing the GPUs.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59833 - Posted: 28 Jan 2023 | 20:17:30 UTC

further on my posting above:

I just want to point out that the same problem exists even if only 2 of the new Pythons are crunched simultaneously (one on "device 0", the other on "device 1) - see my posing here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59825

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59834 - Posted: 28 Jan 2023 | 21:07:22 UTC - in response to Message 59832.

thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI

However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.


The access denied is obviously a permission issue. I don't know how to view the properties of a file in Windows. Maybe right-click? Does that show you who "owns" the file?

Windows probably has the same ownership options or close enough to Linux where a file has permissions at the system level, the group level and the user level.

Maybe the Windows version of nvidia-smi.exe belongs to a Nvidia group which the local user is not a member. Maybe investigate adding the user to the Nvidia group to see if that changes whether the file can be executed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59835 - Posted: 28 Jan 2023 | 21:20:00 UTC

thank you, Keith, for your reply re the Nvidia-SMI. I will investigate further tomorrow.

However, by now, looking at the progress bars, it seems evident enough that the new Python version obviously has a problem with spoofed GPUs. Either by design or by accident. Maybe abouh can tell more

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59836 - Posted: 29 Jan 2023 | 6:48:49 UTC

for the time being, I excluded "device 1" from GPUGRID via setting in the cc_config.xml
So, when downloading Pythons tasks next time, only 2 should come in and be processed by "device 0" (with app_config.xml setting "0.5 GPU usage").

Further, I guess I could not process 4 tasks (from the new type) simultaneously anyway as I can see from the currently running 2 tasks that they are using 12.367 MB VRAM. So not even 1 additional task would work, with the GPU having 16 GB VRAM.

On the other host with the 2 RTX3070 (8 GB VRAM ea.) on which I ran 4 tasks in parallel before, I will now have the problem that only 1 task per GPU can be processed, due to the higher VRAM need. Which is a pitty :-(
And I guess even GPUs with 12 GB VRAM may NOT be able to process 2 new Pythons simultaneously.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59837 - Posted: 29 Jan 2023 | 7:56:30 UTC

this is one of the Pythons which had only CPU utilization, but NOT GPU utilization. So I aborted it after several hours.

https://www.gpugrid.net/result.php?resultid=33271058

Does the stderr show by any chance why the GPU was not utilized?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59838 - Posted: 30 Jan 2023 | 14:05:14 UTC - in response to Message 59835.

Hello Erich,

By design an environment variable defines which GPU is the task supposed to use (in run.py line 429):

os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["GPU_DEVICE_NUM"]


Then, the PyTorchRL package tries to detect that specified GPU, and otherwise uses CPU. So if no GPU is detected it can happen what you mention, that CPU is used instead and the task progress becomes much slower.

What I can do is add an additional logging message in the run.py scripts that will display whether or not the GPU device was detected. So we will know for sure.

Furthermore, I have found a way to reduce at least a bit the GPU memory requirements. I will start using in the newly submitted tasks.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59839 - Posted: 30 Jan 2023 | 14:15:33 UTC - in response to Message 59838.

Thanks abouh!
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59840 - Posted: 30 Jan 2023 | 14:21:52 UTC - in response to Message 59838.

hello abouh,

thanks for your quick reply.

So, as it seems, the situation is such that the tasks from new Python version detect the "real" GPU ("device 0") but do NOT detect the spoofed GPU ("device 1"), for what reason ever.

In the former Python version, both GPUs were detected without any problem.

However, I now found a workaround which also works well:

I excluded "device 1" in the cc_config.xml of BOINC, and I set the GPU usage to "0.3" in the app_config.xml of GPUGRID.
This enables to run 3 Pythons simultaneously. In theory, I could run even 4 tasks by setting the GPU usage to "0.25", but from what I can see now, with 3 tasks running, the VRAM is filled with 16.307MB out of VRAM size 16.384MB.
The progress of the 3 tasks at this moment is 38%, 24% and 22% (they were downloaded at different times), So I can only hope that VRAM utilization will not increase any more in course of task processing.

On another host, I have two Pythons running in parallel, with total VRAM use 6.125 MB out of 6.144 MB available :-)

So, if you mention that you found a way to reduce VRAM requirements a little bit, this will definitely help :-)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59841 - Posted: 30 Jan 2023 | 21:20:31 UTC

Abouh,

This task series and all its bethren have a configuration error and they all are failing very fast.

https://www.gpugrid.net/result.php?resultid=33273094

I've chalked up over 40 errors today and all the wingmen are failing the series in the same way.

File "run.py", line 97, in main
demo_dtypes={prl.OBS: np.uint8, prl.ACT: np.int8, prl.REW: np.float16, "StateEmbeddings": np.int16},
TypeError: create_factory() got an unexpected keyword argument 'state_embeds_to_avoid'

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59842 - Posted: 30 Jan 2023 | 21:56:05 UTC - in response to Message 59841.
Last modified: 30 Jan 2023 | 21:58:25 UTC

Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.

you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.

i have many of the new tasks running just fine.

and memory use is improved. thanks abouh.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59843 - Posted: 31 Jan 2023 | 2:05:13 UTC - in response to Message 59842.

Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.

you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.

i have many of the new tasks running just fine.

and memory use is improved. thanks abouh.

Nope. Absolutely NOT the case. The run.py is the one provided by the project.

Look at the link I provided, every other wingman is failing the task also. Along with all the other failed tasks.

I'm damn sure I reset the project. Resetting again.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59844 - Posted: 31 Jan 2023 | 3:35:41 UTC - in response to Message 59843.
Last modified: 31 Jan 2023 | 3:40:08 UTC

Your stederr output from your failed task in your link clearly indicated that it copied the run.py file. Or was still trying to.

13:00:27 (3925992): wrapper: running /bin/cp (/home/keith/Desktop/BOINC/projects/www.gpugrid.net/run.py run.py)
13:00:28 (3925992): /bin/cp exited; CPU time 0.000962

The only way it would be doing that is because you’re still running my edited file, a project reset would have erased that and been replaced with the standard version.

The other hosts that failed, failed for different reasons. They got unlucky and hit hosts with incompatible GPUs.
____________

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59845 - Posted: 31 Jan 2023 | 4:13:08 UTC

Is there any way to reduce the estimated remaining time showing in the manager on these?

I'm seeing 20+ days left when there is really 10 hours and I can't download the next task until the previous is well past 90%. That's around an hour or so in advance as they are finishing in 9-12 hrs. on my hosts.

Setting my task que longer only gets me the server message:

Tasks won't finish in time: BOINC runs 100.0% of the time; computation is enabled 99.9% of that


It appears that the server sets completion times based on the average among completed WU run times. Seeing that Pythons misreport the run time (which must be equal to or greater than the CPU time) it is logical that the estimated future completion times would reflect the inflated CPU time figures.

Is there a local manager config fix for that, anyone? Multi gratis

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59846 - Posted: 31 Jan 2023 | 8:47:42 UTC - in response to Message 59843.
Last modified: 31 Jan 2023 | 8:51:36 UTC

Hello Keith,

I also think for some reason your machine ran the old run.py file. Maybe the error was in the server side and the old script was provided by some reason, but I went and checked the files in task e00004a01419-ABOU_rnd_ppod_expand_demos29_exp1-0-1-RND6419_2 and the error should not be present.

Also, as I mentioned in a previous post I added and extra log message to check if a GPU is detected

sys.stderr.write(f"Detected GPUs: {gpus_detected}\n")


Which is only printed with the new runs but not in the one you shared. For example, in this one:

https://www.gpugrid.net/result.php?resultid=33273691
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59847 - Posted: 31 Jan 2023 | 13:05:00 UTC - in response to Message 59845.

Is there any way to reduce the estimated remaining time showing in the manager on these?

I'm seeing 20+ days left when there is really 10 hours and I can't download the next task until the previous is well past 90%. That's around an hour or so in advance as they are finishing in 9-12 hrs. on my hosts.

Setting my task que longer only gets me the server message:
Tasks won't finish in time: BOINC runs 100.0% of the time; computation is enabled 99.9% of that


It appears that the server sets completion times based on the average among completed WU run times. Seeing that Pythons misreport the run time (which must be equal to or greater than the CPU time) it is logical that the estimated future completion times would reflect the inflated CPU time figures.

Is there a local manager config fix for that, anyone? Multi gratis

___________________

I will agree with Pop. The same thing is going on, on my machine.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59848 - Posted: 31 Jan 2023 | 15:03:25 UTC

abouh, are you planning to release another large batch of the new tasks?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59849 - Posted: 31 Jan 2023 | 16:51:19 UTC - in response to Message 59848.

Yes! The experiment I am currently running has a population of 1000 (so it maintains the number of submitted tasks to 1000 by sending a task every time 1 ends until a certain global goal is reached)

I am planning to start another 1000 agent experiment, probably tomorrow.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59850 - Posted: 31 Jan 2023 | 16:54:30 UTC - in response to Message 59847.

Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate at what time a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.
____________

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59851 - Posted: 31 Jan 2023 | 20:21:26 UTC - in response to Message 59849.
Last modified: 31 Jan 2023 | 20:21:52 UTC

Thanks abouh,

When there is a steady flow of tasks from Grosso the window of time a host needs to secure a new WU and maintain constant production shrinks drastically.

There is no need then to try to download the next task until the present task is almost finished, so the estimate of time remaining becomes of little concern to me.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59852 - Posted: 31 Jan 2023 | 22:21:08 UTC - in response to Message 59850.

Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate at what time a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.

And that can be easily utilised by setting

fraction_done_exact
if set, base estimates of remaining time solely on the fraction done reported by the app.

in app_config.xml. It wobbles a bit at the beginning, but soon settles down.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59853 - Posted: 1 Feb 2023 | 1:22:03 UTC - in response to Message 59852.
Last modified: 1 Feb 2023 | 1:24:36 UTC

Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate when a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.

And that can be easily utilised by setting

fraction_done_exact
if set, base estimates of remaining time solely on the fraction done reported by the app.

in app_config.xml. It wobbles a bit at the beginning but soon settles down.

______________

Richard, could you point me in the direction of app_config.xml. Where is it? Second, are we not playing around a bit too much? No other project requires us to play. Unless we are up to mischief and try to run multiple WU's at the same time and when they start crashing, blame others.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59854 - Posted: 1 Feb 2023 | 8:18:36 UTC - in response to Message 59853.

It's documented in the User manual, specifically at:

https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59855 - Posted: 1 Feb 2023 | 11:45:10 UTC - in response to Message 59854.

It's documented in the User manual, specifically at:

https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration

_____________
That I can find but not on my computer unless it is hidden.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59856 - Posted: 1 Feb 2023 | 13:50:54 UTC - in response to Message 59855.
Last modified: 1 Feb 2023 | 14:20:11 UTC

_____________
That I can find but not on my computer unless it is hidden.

the app_config.xml is not there automatically, that's why you won't find it.
You need to write it yourself e.g. by the Editor or Notepad, than save it as "app_config.xml" in the GPUGRID project folder within the BOINC folder (contained in the ProgramData folder).

In order to put the app_config.xml into effekt, after having done the above mentioned steps, you need to open the "Options" tab in the BOINC manager and push once "read config files".

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59857 - Posted: 1 Feb 2023 | 17:01:53 UTC - in response to Message 59856.

_____________
That I can find but not on my computer unless it is hidden.

the app_config.xml is not there automatically, that's why you won't find it.
You need to write it yourself e.g. by the Editor or Notepad, than save it as "app_config.xml" in the GPUGRID project folder within the BOINC folder (contained in the ProgramData folder).

In order to put the app_config.xml into effekt, after having done the above mentioned steps, you need to open the "Options" tab in the BOINC manager and push once "read config files".

_______________________________
Is this Boinc in only one place?
OS(C)
Program files
Boinc
Locale
Skins
Boinc
boinc_logo_black
Boinccmd
Boincmgr
Boincscr
boincsvcctrl
boinctray
ca-bundle
COPYING
COPYRIGHT
liberationSans-Regular.
This is all I can find in the Boinc folder. no GPUGrid folder or any other project folder.
Unless, Boinc is in two places like in the old days.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59858 - Posted: 1 Feb 2023 | 18:09:47 UTC - in response to Message 59854.

Thank you Richard Hazelgrove,

It's documented in the User manual, specifically at:

https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration


I was unaware of that info site. Will cure my ignorance.
Many thanks.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59859 - Posted: 1 Feb 2023 | 18:30:55 UTC - in response to Message 59857.


_______________________________
Is this Boinc in only one place?
OS(C)
Program files
Boinc
.,..
This is all I can find in the Boinc folder. no GPUGrid folder or any other project folder.
Unless, Boinc is in two places like in the old days.

You are in the wrong folder.
BOINC still is in two places.
Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59860 - Posted: 2 Feb 2023 | 1:25:37 UTC - in response to Message 59857.
Last modified: 2 Feb 2023 | 1:43:45 UTC

Hi KAMasud,

The folder mentioned by Erich56 here...

Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID


...is a hidden folder in windows so you must choose to show hidden folders in the file preferences to access it. (just in case you might not know that and can't see it in the program manager)

[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net
on my hosts

Hope that helped.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59862 - Posted: 2 Feb 2023 | 4:43:19 UTC - in response to Message 59860.

[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net

yes, this is true. Sorry for the confusion.

However, on my system (Windows 10 Pro) this folder is NOT a hidden folder.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59863 - Posted: 2 Feb 2023 | 6:09:36 UTC - in response to Message 59860.

Hi KAMasud,

The folder mentioned by Erich56 here...
Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID


...is a hidden folder in windows so you must choose to show hidden folders in the file preferences to access it. (just in case you might not know that and can't see it in the program manager)

[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net
on my hosts

Hope that helped.

____________________

Thank you, Pop. After this update from Microsoft, Windows has become ____. Needs Administrator privileges for everything. Even though it is my private computer. Yesterday, I marked show hidden folders it promptly hid them back. Today I unhid them and told it "stay", good doggy. I found the second folder in which I found the projects folders. Thank you everyone.
Fat32 was better in some ways. I have done what you all wanted me to do but years ago. Maybe two decades ago.
Erich, Richard, thank you.
Apple products are user repair unfriendly. Laptops are becoming repair unfriendly and now, Microsoft, is going the same way.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59900 - Posted: 11 Feb 2023 | 9:52:14 UTC

I've been away from GPUGrid for a while...
Is there a way to control the number of spawned threads?
I've tried to modify the line:

<setenv>NTHREADS=$NTHREADS</setenv>
in linux_job.###########.xml file to
<setenv>NTHREADS=8</setenv>
but it made no difference.
The task was started with the original NTHREADS setting.
Is it the reason for no change in the number of spawned threads, or I should modify something else?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59901 - Posted: 11 Feb 2023 | 13:28:59 UTC - in response to Message 59900.

I've been away from GPUGrid for a while...
Is there a way to control the number of spawned threads?
I've tried to modify the line:
<setenv>NTHREADS=$NTHREADS</setenv>
in linux_job.###########.xml file to
<setenv>NTHREADS=8</setenv>
but it made no difference.
The task was started with the original NTHREADS setting.
Is it the reason for no change in the number of spawned threads, or I should modify something else?


there is no reason to do this anymore. they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time.

if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance.
____________

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59902 - Posted: 11 Feb 2023 | 16:25:16 UTC

Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~

I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59903 - Posted: 11 Feb 2023 | 21:10:57 UTC
Last modified: 11 Feb 2023 | 21:35:20 UTC

they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time.


I am enjoying watching abouh gain prowess at scripting with each run, using less and less resources as they evolve. Real progress. Godspeed to abouh and crew.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59904 - Posted: 11 Feb 2023 | 23:36:24 UTC - in response to Message 59901.

Is there a way to control the number of spawned threads?
there is no reason to do this anymore.
My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).
Could the GPU usage be increased somehow?

it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents.
there is no way to reduce that ...
I confirm that. I looked into that script, though I'm not very familiar with python. I've even tried to modify the num_env_processes in conf.yaml, but this file gets overwritten every time I restart the task, even though I removed the rights of the boinc user and the boinc group to write that file. :)

if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance.
That's clear I did that.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59905 - Posted: 12 Feb 2023 | 1:05:32 UTC

Good to see you Zoltan.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59906 - Posted: 12 Feb 2023 | 1:10:21 UTC - in response to Message 59902.

Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~

I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble.



Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59907 - Posted: 12 Feb 2023 | 1:10:24 UTC - in response to Message 59902.

Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~

I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble.



Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59908 - Posted: 12 Feb 2023 | 1:18:15 UTC - in response to Message 59904.
Last modified: 12 Feb 2023 | 1:20:40 UTC

Is there a way to control the number of spawned threads?
there is no reason to do this anymore.
My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).
Could the GPU usage be increased somehow?


If you need the heat output of the GPU, then you need to run a different project. Or only run ACEMD3 tasks when they are available. You will not get it from the Python tasks in their current state.

You can increase the GPU use by adding more tasks concurrently. But not to the extent that you expect or need. I run 4x tasks on my A4000s but they still don’t even have full utilization. Usually only like 40% and ~100W avg power draw. Two tasks aren’t gonna cut it for increasing utilization by any substantial amount.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59909 - Posted: 12 Feb 2023 | 6:36:52 UTC - in response to Message 59905.

Good to see you Zoltan.

+1

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59910 - Posted: 12 Feb 2023 | 22:48:44 UTC - in response to Message 59904.

...I need the full heat output of my GPUs to heat our apartment...


It's been a bit chilly in my basement "computer lab/mancave" running these this winter, but I'm saving power($) so I'm bearing it. I just hope they last into summer so I can stay cool here in the humid Mississippi river valley of Illinois.

I've had some success running Einstein GPU tasks concurrently with Pythons and saw full GPU usage, although there is of course a longer completion time for both tasks.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59912 - Posted: 13 Feb 2023 | 1:29:20 UTC - in response to Message 59908.
Last modified: 13 Feb 2023 | 1:32:48 UTC

If you need the heat output of the GPU, then you need to run a different project.
I came to that conclusion, again.

Or only run ACEMD3 tasks when they are available.
I caught 2 or 3, that's why I put 3 host back to GPUGrid.

You will not get it [the full GPU heat output] from the Python tasks in their current state.
That's regrettable, but it could be ok for me this spring.

My main issue with the python app is that I think there's no point running that many spawned (training) threads, as their total (combined) memory access operations cause massive amount of CPU L3 cache misses, hindering each other's performace.
Before I've put my i9-12900F host back to GPUGrid, I run 7 TN-Grid tasks + 1 FAH GPU task simultaneously on that host, the average processing time was 4080-4200 sec for the TN-Grid tasks.
Now I run 1 GPUGrid task + 1 TN-Grid task simultaneously, and the processing time of the TN-Grid task went up to 4660-4770 sec. Compared to the 6 other TN-Grid tasks plus a FAH task the GPUGrid python task cause a 14% performance loss.
You can see the change in processing times for yourself here.
If I run only 1 TN-Grid task (no GPU tasks) on that host, the processing time is 3800 seconds. Compared to that, running a GPUGrid pythnon task cause a 22% performance loss.
Perhaps this app should do a short benchmark of the given CPU it's actually running on to establish the ideal number of training threads, or give some control of that number for the advanced users like me :) to do that benchmarking of their respective systems.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59913 - Posted: 13 Feb 2023 | 1:43:43 UTC - in response to Message 59912.

I don't think you understand what the intention of the researcher is here. he wants 32 agents and the whole experiment is designed around 32 agents. and agent training happens on the CPU, so each agent needs its own process. you can't just arbitrarily reduce this number without the researcher making the change for everyone. it would fundamentally change the research. you could only reduce the number of agents with a new/different experiment.

or make MASSIVE changes to the code to push it all into the GPU, but likely most GPUs wouldn't have enough VRAM to run it and everyone would be complaining about that instead.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59914 - Posted: 13 Feb 2023 | 7:27:30 UTC - in response to Message 59913.
Last modified: 13 Feb 2023 | 7:27:54 UTC

Hello everyone,

this is exactly correct, agents collect data from their interaction with the environment (running on CPU), and the data is posteriorly used to update the neural network that controls action selection (on GPU).

Having multiple agents allows to collect data in parallel, speeding up training.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59915 - Posted: 13 Feb 2023 | 15:04:18 UTC

I think I am going a bit mad, I set the app_config file to use 0.33 GPU to try and get more units running at the same time, I then remembered 2 is the max, however this config when running 2 seemed to go faster, units completed 25% in about 3 hours, normally I think the units take a lot longer than this.
I will need to take a week to so to double-check this though.
What's the optimal config at the moment? this is my current one:

<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>8</cpu_usage>
<gpu_usage>0.5</gpu_usage>
</gpu_versions>
</app>
</app_config>

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59916 - Posted: 13 Feb 2023 | 19:17:59 UTC - in response to Message 59915.
Last modified: 13 Feb 2023 | 19:40:30 UTC

Ryan, here's what works for me:

<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd3</name>
<max_concurrent>2</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<project_max_concurrent>2</project_max_concurrent>
<report_results_immediately/>
</app_config>

You can change the numbers whenever ACEMDs are available and allow them to run concurrent with a Python.

You will need to adjust the CPU figures to match your present appconfig.

(Many thanks Richard Hazelgrove, for helping me upthread)

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,275,618,238
RAC: 1,158,252
Level
Met
Scientific publications
wat
Message 59917 - Posted: 14 Feb 2023 | 16:09:12 UTC

Thanks, is 1 CPU per python unit enough? what times are you getting per unit? when I run 8 threads per unit and other tasks on the spare threads my CPU is always running at 100%.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59918 - Posted: 14 Feb 2023 | 17:10:42 UTC

It is not about how many threads your machine has, it is about how many tasks you can run alongside a Python. I have a six-core, twelve threads but can only run three Einstein WUs and my CPU peaks at 82%. A fine balancing act is required and sometimes a GPUGrid WU arrives and I have to suspend other work.
I have also reached the limit of my 16GB RAM(sometimes) other times? These AI WUs seem to be outdoing us. Monitoring is also required. Pop will explain.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59920 - Posted: 14 Feb 2023 | 19:35:24 UTC
Last modified: 14 Feb 2023 | 19:36:00 UTC

Anybody else getting sent Python tasks for the old 1121 app? I have been using the newer 1131 app and it has worked fine on all tasks.

I don't even have the old 1121 app anymore since I did a project reset to use the new python job file for reduced cpu usage.

The 1121 app tasks are instant erroring out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59921 - Posted: 14 Feb 2023 | 20:10:49 UTC - in response to Message 59920.

Anybody else getting sent Python tasks for the old 1121 app?

not so far

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59922 - Posted: 14 Feb 2023 | 20:33:08 UTC - in response to Message 59921.

Based on the number of _x issues of these tasks and everyone else erroring out, must be a scheduler issue.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59923 - Posted: 14 Feb 2023 | 23:20:18 UTC

I've received some of them so far. they fail within like 10 seconds.

looks like someone at the project put the old v4.01 linux app up. these seem not compatible with the new experiment. I'm guessing someone enabled that application by accident.

abouh, you probably need to pull this app version back down to prevent it from being sent out. and leave the working v4.03 up.
____________

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59924 - Posted: 15 Feb 2023 | 7:23:10 UTC - in response to Message 59917.

is 1 CPU per python unit enough?

Ryan, you have a professional market CPU so I can't tell you from experience. Also, I haven't experimented with the CPU figures much yet.

I run 1 Python at a time because my hosts are limited in comparison to yours.
Seeing your host it looks to me like you can run 2 Pythons simultaneously.
(Perhaps Erich56 might share how he manages his very capable i-9 windows host.)

what times are you getting per unit?

When left to run with no competition for CPU time, my hosts finish a Python task in somewhere between 9 and 12 hrs., depending on the host's CPU.
I've found that running either a CPU task or a second GPU task along side of a Python slows it down noticeably, adding an hour or two to the observed run time. This is quite acceptable in my opinion if running one of the ACEMD tasks concurrently, whenever they're available.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2363
Credit: 16,525,646,203
RAC: 3,203,477
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59925 - Posted: 15 Feb 2023 | 13:43:25 UTC - in response to Message 59920.
Last modified: 15 Feb 2023 | 13:44:53 UTC

Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out.
I had four. All have failed on my host, but one of them finished on the 7th resend.
Edit: because that was the 1131 app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59926 - Posted: 15 Feb 2023 | 13:46:16 UTC - in response to Message 59925.
Last modified: 15 Feb 2023 | 13:46:46 UTC

Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out.
I had four. All have failed on my host, but one of them finished on the 7th resend.


notice that the host that finished it was with the working v4.03 app. not the troublesome v4.01.

the problem is the app that gets assigned to the task, not the task itself.

the v4.01 linux app needs to be pulled from the apps list so the scheduler stops trying to use it.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59927 - Posted: 15 Feb 2023 | 21:38:17 UTC - in response to Message 59926.

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.

hopefully someone from the project notices these posts to take it down soon.
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 59928 - Posted: 15 Feb 2023 | 22:59:17 UTC
Last modified: 15 Feb 2023 | 22:59:49 UTC

Does anyone have problems running gpugrid with latest windows update?
[Version 10.0.22621.1265]

I had to revert it.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59929 - Posted: 16 Feb 2023 | 0:28:29 UTC - in response to Message 59927.
Last modified: 16 Feb 2023 | 0:32:20 UTC

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.


Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.

I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59930 - Posted: 16 Feb 2023 | 1:27:45 UTC - in response to Message 59929.

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.


Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.

I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why.


the error is not with the script or task configuration at all.

the problem is the application version that the project is sending.

Windows only has one app version, v4.04. Windows hosts will not see a problem with this.

Linux used to have only one also, v4.03 which works fine. but something happened a few days ago where the project put up the old v4.01 app for linux from 2021. the scheduler will try to send this app randomly to compatible hosts (any app currently able to run cuda 1131 can also run 1121, so it will send one or the other by chance). this is the problem. it's randomly sending some tasks assigned with the v4.01 app which is not compatible with these newer tasks.

https://gpugrid.net/apps.php
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59931 - Posted: 16 Feb 2023 | 7:01:55 UTC
Last modified: 16 Feb 2023 | 7:18:14 UTC

I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?

application ./gpugridpy/bin/python missing


I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.

I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59932 - Posted: 16 Feb 2023 | 9:41:30 UTC
Last modified: 16 Feb 2023 | 9:47:04 UTC

I've been away for a few days, concentrating on another project, and came back to this. I still have the v4.03 files (although I'd reset away the v4.01 files).

So, experimentally, I allowed new work, and suspended the single task issued before it had finished downloading. I got task 33308822 - a _6 resend issued with a new copy of the v4.01 files.

So, I stopped BOINC, and carefully edited client_state.xml: the version number to 403 in both <workunit> and <result>, and the plan_class to 1131 in <result> (three changes in all). It's running normally now: we'll see what happens when it reports in about 8 hours time.

Edit: the _5 replication (task 33308656) was issued as version 4.03, but failed because file pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2 couldn't be found. That needs to be checked on the server - are the app_version files still there?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 59933 - Posted: 16 Feb 2023 | 11:18:14 UTC - in response to Message 59931.
Last modified: 16 Feb 2023 | 11:54:22 UTC

I it [sic] so weird that suddenly jobs are sent to [sic] the wrong app version
I haven't run python WUs in a while but when I started them today I first got a pair of 4.01s that both failed and had this message:
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0

Please update conda by running

$ conda update -n base -c defaults conda

The next WUs that replaced them were 4.03s and are running fine. Not sure how to check if I now have 23.1.0 installed.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59934 - Posted: 16 Feb 2023 | 12:14:58 UTC - in response to Message 59931.

I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?

application ./gpugridpy/bin/python missing


I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.

I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.


It’s nothing wrong with your scripts.

You need to remove the app version 4.01 from the server apps list. So it’s not an option to choose.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59935 - Posted: 16 Feb 2023 | 13:57:14 UTC

My second machine is coming free soon, so I've downloaded a task for that one, too.

That's arrived as v4.03, so no editing necessary. If the later app has now been given top priority (as it should have been all along), that's fine by me. I agree that v4.01 should be deprecated off the apps page, but it's a less urgent task - they may still need it as evidence for the post-mortem, while they're trying to work out what went wrong.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59940 - Posted: 16 Feb 2023 | 17:51:26 UTC - in response to Message 59932.

task 33308822

has finished and has been deemed to be valid. So if it happens again, and you still have the v4.03 files, changing the version numbers is a valid option.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59942 - Posted: 16 Feb 2023 | 18:40:17 UTC

Good thing I checked. Just got allocated two brand new tasks, created today, and they both came allocated to v4.01

I didn't manage to reach the first in time, and it errored (as expected). I did catch the second, modified it as before, and it's running under v4.03

The beginnings of a suspicion are forming in my mind, and I'll check it when the second machine is ready for another fetch.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59943 - Posted: 16 Feb 2023 | 18:45:06 UTC - in response to Message 59942.
Last modified: 16 Feb 2023 | 18:45:51 UTC

probably would be more effective to just rename/replace the job setup files (jobs.xml, and zipped package). then set <dont_check_file_sizes>. this way it will call what it thinks is the 4.01 files, but it's really calling the 4.03 files. and you wont need to be constantly stopping BOINC to edit the client state each time.

but I'm just going to keep aborting stuff until the project figures out how to de-publish the bad app. I'm not sure what the hold up or confusion is there. they publish and remove apps all the time, and I've explained the issue several times. all they need to do is remove 4.01 from the apps list. they should know exactly how to do this.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59944 - Posted: 16 Feb 2023 | 18:59:08 UTC - in response to Message 59943.

It would be easier to simply delete the v4.01 <app_version> and clone the v4.03 section. Then it's just a couple of one-character changes to the version number and the plan class.

I'll try that when there's no GPUGrid task running, and I've got time to think.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59945 - Posted: 17 Feb 2023 | 12:21:35 UTC

Well, no new Python tasks this morning, but I've got a couple of resends.

The first, on host 508381, came through as v4.03, and is running normally.

The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued.

There seems to be no rhyme nor reason to it. Take a look at the tasks for the most recent host that failed for the first resend: host 602633. That one's been sent v4.01 and v4.03 seemingly at random - which blows the theory I was trying to dream up out of the water. If there's no coherent pattern to what should be a deterministic process, I'm not surprised the project team are stumped. But the answer has to stay the same: KILL OFF v4.01 FOR GOOD.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59946 - Posted: 17 Feb 2023 | 12:36:45 UTC - in response to Message 59945.
Last modified: 17 Feb 2023 | 12:42:21 UTC

The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued


that's exactly why I suggested to replace the archive and job.xml files with the ones from the 4.03 app (along with the dont_check_file_sizes flag), so you don't have to keep editing the client state file. with replacing the package files instead, it thinks it already has the 4.01 files and uses them unaware that they are really the 4.03 files.

but yes, what really needs to happen is the removal of 4.01 from the project side.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59947 - Posted: 17 Feb 2023 | 16:37:11 UTC
Last modified: 17 Feb 2023 | 16:39:49 UTC

I have asked the project admins to deprecate version 4.01 and 4.02. Sorry for the delay, I could not do it myself.

I am not sure what caused the sudden change but I hope now is fixed. Please let me know if the problem continues and will try to solve it.

Happy weekend to everyone!
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59949 - Posted: 17 Feb 2023 | 16:54:38 UTC - in response to Message 59947.

Thanks abouh! I see that the v4.01 app is now gone from the applications page, so that should solve the issue for everyone :)

I see Python tasks are winding down. do you have another experiment lined up to last over the weekend?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59950 - Posted: 17 Feb 2023 | 17:22:15 UTC - in response to Message 59949.

Yes over the weekend I will review the results of the 2 experiments that just finished and start new ones. The idea is to continue like until now. With two populations of 1000 agents (task) each.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59951 - Posted: 17 Feb 2023 | 19:11:13 UTC

And thanks from me, too. That went very smoothly, and allocation of v4.03 hasn't been disturbed. Another resend has arrived for processing when this one finishes, without manual intervention.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,928,481,630
RAC: 4,906,648
Level
Trp
Scientific publications
watwatwat
Message 59952 - Posted: 18 Feb 2023 | 15:26:40 UTC - in response to Message 59933.


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0

Please update conda by running

$ conda update -n base -c defaults conda

Does anyone know if I need to install Miniconda and/or Anaconda to satisfy this error message?
E.g.: https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
My Linux Mint Synaptic Package Manager can't find any program containing "conda."
Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59953 - Posted: 18 Feb 2023 | 15:46:54 UTC - in response to Message 59952.

Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers?

It's not an error, it's simply a warning - information, if you like.

The project supply the conda package (which is why Mint doesn't know about it), and they're obviously happy with the version they're using. You don't need to do anything.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59954 - Posted: 18 Feb 2023 | 15:53:02 UTC - in response to Message 59952.
Last modified: 18 Feb 2023 | 15:57:14 UTC


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0

Please update conda by running

$ conda update -n base -c defaults conda

Does anyone know if I need to install Miniconda and/or Anaconda to satisfy this error message?
E.g.: https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
My Linux Mint Synaptic Package Manager can't find any program containing "conda."
Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers?


even if you installed it, it wouldnt change anything and you'd get the same warning message. as Richard wrote, these tasks use its own environment. they do not use your locally installed conda at all. which is why they work on systems that do not have conda installed at all. this is all by design to avoid any version conflicts or dependencies on the local system. it has been this way from the beginning.

additionally, this message was only present when trying to run the old/incompatible 4.01 app. you do not get that message from the correct 4.03 app. 4.01 was re-published by accident and is an app version about 1.5 years old. it is not compatible with the design/structure/requirements of how these tasks function today. the project admins have removed this version so you wont see this problem again.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59955 - Posted: 19 Feb 2023 | 6:59:35 UTC

what catches my eye:

the Pythons which I got downloaded within the past 2 days seem to use a lot less system memory than the ones before.
Has Abouh made any changes to this effect?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 59956 - Posted: 19 Feb 2023 | 17:49:40 UTC

❸ Name Abouh Meaning
Exceptional qualities that make this name special are negotiation skills and an amazing sense of tact. When developed these two assets will help you get what you want and achieve all goals.

Cooperation is a key aspect of your life!

Since you are far less successful in life if you do not find a level of unity with others. Most problems that people find too tricky to solve are often no match against your ingenuity.

Harmony in your surrounding is a key to happiness and feelings of relaxation. Having friends or family fight affects you greatly in a negative sense. That is why you have the reputation of being a peacemaker(only out of necessity).

At your best you become very kind hearted, charming and full of positive energy. An amazing person to spend time with.

feri
Send message
Joined: 31 Mar 20
Posts: 2
Credit: 139,952,008
RAC: 0
Level
Cys
Scientific publications
wat
Message 59957 - Posted: 19 Feb 2023 | 19:56:47 UTC

hi all,
...i used to contribute to GPUgrid with 1 gtx1080ti since 2020, 4core nonHT cpu
since the mostly GPU based acemd tasks i see a lot has changed regarding the effective HW requirements,
so i m wondering what is a somewhat optimal HW setup with the current python tasks, or what is potentialy the biggest current bottleneck?
-CPU core/thread #?
-RAM size?
-RAM speed vs latency preference?
-SSD speed
...i concur ECC RAM is very needed for long runtimes/nonstop usage.
___________
Frank from Slovakia, EU

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59958 - Posted: 19 Feb 2023 | 21:49:01 UTC - in response to Message 59957.

For Python, you don't need to worry about anything other than having 8-10 cpu cores to support the task.

Enough system memory, at least 16GB.

Enough virtual swap space, at least 50GB.

Enough VRAM on the gpu, at least 4-6GB.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59959 - Posted: 21 Feb 2023 | 1:25:10 UTC - in response to Message 59957.

feri, If I might add to Keith Myers' excellent synopsis, the speed at which these tasks run appears more dependent upon CPU ability than GPU ability. You might want to consider that if you are thinking about assembling a host dedicated to running pythons and you maybe have an old GTX 1060 6GB or something else with sufficient VRAM (GTX1650) laying around.

feri
Send message
Joined: 31 Mar 20
Posts: 2
Credit: 139,952,008
RAC: 0
Level
Cys
Scientific publications
wat
Message 59962 - Posted: 22 Feb 2023 | 22:12:09 UTC - in response to Message 59959.

..a friend of mine actualy has a gtx1060 6GB laying around
thanks for the insights

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59963 - Posted: 22 Feb 2023 | 23:35:29 UTC - in response to Message 59962.

Look at Richard Haselgrove's results with 6GB GTX 1060's

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59966 - Posted: 23 Feb 2023 | 7:44:19 UTC - in response to Message 59963.

Look at Richard Haselgrove's results with 6GB GTX 1060's

They're GTX 1660s, but 6 GB is right. They run fine on a setting of 3 CPUs + 1 GPU - a bit over 8 hours for the current jobs.

Relles
Send message
Joined: 1 Nov 17
Posts: 2
Credit: 29,189,111
RAC: 58,395
Level
Val
Scientific publications
wat
Message 59982 - Posted: 25 Feb 2023 | 13:44:03 UTC
Last modified: 25 Feb 2023 | 13:45:07 UTC

I've noticed that on the same computer (with dual boot), tasks finish almost twice as fast on Ubuntu compared to Windows. I've tried running tasks on Linux only a few days ago and did so on Windows before.
Has there been any recent change or do tasks just compute more efficiently on Linux?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 59983 - Posted: 25 Feb 2023 | 15:25:47 UTC - in response to Message 59982.

They have always been faster on Linux
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59984 - Posted: 25 Feb 2023 | 15:29:24 UTC - in response to Message 59982.

... tasks finish almost twice as fast on Ubuntu compared to Windows. I've tried running tasks on Linux only a few days ago and did so on Windows before.


They have always been faster on Linux


that's correct. What surprises me though is that tasks finish almost twice as fast. I don't think that this was true before, was it?

Relles
Send message
Joined: 1 Nov 17
Posts: 2
Credit: 29,189,111
RAC: 58,395
Level
Val
Scientific publications
wat
Message 59985 - Posted: 25 Feb 2023 | 18:02:38 UTC - in response to Message 59984.

Close to 10 hours are needed on Windows and almost six on Linux. I also find the difference striking, that's why I asked

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 59995 - Posted: 26 Feb 2023 | 16:40:50 UTC

Anyone having problems getting the ATM tasks to upload? I have 4 completed jobs on 3 machines trying to upload and have not been able to make contact for nearly a day now. Two tasks on one machine making that device unable to get any more work.

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 25,454,208,321
RAC: 13,144,023
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 59996 - Posted: 27 Feb 2023 | 19:00:24 UTC - in response to Message 59995.

Got several ATM stuck in upload there is now 2 days left to deadline.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59997 - Posted: 28 Feb 2023 | 2:07:03 UTC - in response to Message 59995.

I've been watching the ATMs on the linux hosts (since they won't run on my windoze machines) to find a siderr file of a finished WU to study the linux (while I try to learn it).
I haven't found one. Only 'in process'; most show previous failures which vary from host to host. I'd be interested to see one completed if anybody can post a link.
Thanks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59998 - Posted: 28 Feb 2023 | 2:32:08 UTC - in response to Message 59997.

I had a couple of the ATM's finish successfully a week ago, but long cleared from the database for anyone to look at.

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 25,454,208,321
RAC: 13,144,023
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 59999 - Posted: 28 Feb 2023 | 17:42:01 UTC - in response to Message 59997.

Here is one completed Pop Piasa
https://www.gpugrid.net/result.php?resultid=33327466

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 60000 - Posted: 28 Feb 2023 | 19:40:14 UTC - in response to Message 59999.

Thanks Greger, it's good to have a successful example to compare with when examining errors. I appreciate it.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 60001 - Posted: 3 Mar 2023 | 10:26:16 UTC
Last modified: 3 Mar 2023 | 10:42:25 UTC

Windows here. You know, sometimes these WUs go to sleep, then I click the mouse and it starts running again. Not all WUs.

task 33333635

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 60004 - Posted: 3 Mar 2023 | 10:44:06 UTC - in response to Message 60001.

Maybe you can change system power settings?
Disable spinning down hard drive for example?

ngkachun1982
Send message
Joined: 22 Apr 19
Posts: 3
Credit: 20,051,459
RAC: 76,218
Level
Pro
Scientific publications
wat
Message 60112 - Posted: 19 Mar 2023 | 14:57:35 UTC

My recent results uploaded to GPUGRID often got "Error while computing" and lost all credits, I don't know why, what should I do ?

33359888 27429785 604308
17 Mar 2023 | 13:14:49 UTC 19 Mar 2023 | 14:23:21 UTC
Error while computing 50,964.34 50,964.34 ---
Python apps for GPU hosts v4.04 (cuda1131)


19/3/2023 17:37:41 | | Starting BOINC client version 7.20.2 for windows_x86_64
19/3/2023 17:37:41 | | log flags: file_xfer, sched_ops, task
19/3/2023 17:37:41 | | Libraries: libcurl/7.84.0-DEV Schannel zlib/1.2.12
19/3/2023 17:37:41 | | Data directory: C:\ProgramData\BOINC
19/3/2023 17:37:41 | |
19/3/2023 17:37:41 | | CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 3060 (driver version 531.18, CUDA version 12.1, compute capability 8.6, 12288MB, 12288MB available, 12738 GFLOPS peak)
19/3/2023 17:37:41 | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce RTX 3060 (driver version 531.18, device version OpenCL 3.0 CUDA, 12288MB, 12288MB available, 12738 GFLOPS peak)
19/3/2023 17:37:41 | | Windows processor group 0: 20 processors
19/3/2023 17:37:41 | | Host name: NGcomputer
19/3/2023 17:37:41 | | Processor: 20 GenuineIntel 12th Gen Intel(R) Core(TM) i7-12700F [Family 6 Model 151 Stepping 2]
19/3/2023 17:37:41 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 tm2 pbe fsgsbase bmi1 smep bmi2
19/3/2023 17:37:41 | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
19/3/2023 17:37:41 | | Memory: 15.76 GB physical, 63.76 GB virtual
19/3/2023 17:37:41 | | Disk: 952.93 GB total, 700.19 GB free
19/3/2023 17:37:41 | | Local time is UTC +8 hours
19/3/2023 17:37:41 | | No WSL found.
19/3/2023 17:37:41 | | VirtualBox version: 7.0.6



Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60113 - Posted: 19 Mar 2023 | 17:21:18 UTC - in response to Message 60112.

You have to look at the errored task results on the website to find why you errored.

Two of the tasks errored out because you don't have enough virtual memory available for the expansion phase where the task sets up its libraries.

On Windows it is advised to set up your system page file for at least 50GB size.

ngkachun1982
Send message
Joined: 22 Apr 19
Posts: 3
Credit: 20,051,459
RAC: 76,218
Level
Pro
Scientific publications
wat
Message 60118 - Posted: 20 Mar 2023 | 10:55:45 UTC - in response to Message 60113.

You have to look at the errored task results on the website to find why you errored.

Two of the tasks errored out because you don't have enough virtual memory available for the expansion phase where the task sets up its libraries.

On Windows it is advised to set up your system page file for at least 50GB size.


Thank you very much

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 60125 - Posted: 22 Mar 2023 | 5:07:40 UTC

The server status shows WU's are available but my machines have received no task since yesterday.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60127 - Posted: 22 Mar 2023 | 7:44:52 UTC - in response to Message 60125.
Last modified: 22 Mar 2023 | 7:45:19 UTC

Hello!

The previous population experiment ended and needed to analyse the results.
But I am starting a new experiment today.
____________

ngkachun1982
Send message
Joined: 22 Apr 19
Posts: 3
Credit: 20,051,459
RAC: 76,218
Level
Pro
Scientific publications
wat
Message 60131 - Posted: 22 Mar 2023 | 14:42:28 UTC

I don't understand why my task fail, why ?

Name e00002a04604-ABOU_rnd_ppod_expand_demos29_2_exp5-0-1-RND9901_1
Workunit 27434170
Created 20 Mar 2023 | 22:42:54 UTC
Sent 21 Mar 2023 | 0:31:58 UTC
Received 21 Mar 2023 | 11:05:25 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 604308
Report deadline 26 Mar 2023 | 0:31:58 UTC
Run time 13,385.72
CPU time 13,385.72
Validate state Invalid
Credit 0.00
Application version Python apps for GPU hosts v4.04 (cuda1131)
Stderr output

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
08:32:00 (19880): wrapper (7.9.26016): starting
08:32:00 (19880): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688

Everything is Ok

Size: 6410311680
Compressed: 1976180228
08:34:12 (19880): .\7za.exe exited; CPU time 107.906250
08:34:12 (19880): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
08:34:13 (19880): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
08:34:13 (19880): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)

7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)

Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII

Everything is Ok

Files: 38141
Size: 6380353601
Compressed: 6410311680
08:35:04 (19880): .\7za.exe exited; CPU time 9.515625
08:35:04 (19880): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
08:35:05 (19880): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
08:35:05 (19880): wrapper: running python.exe (run.py)
Windows fix executed.
Detected GPUs: 1
Define environment factory
Define algorithm factory
Define storage factory
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {16368} normal block at 0x0000025533011E30, 8 bytes long.
Data: < 4U > 00 00 D1 34 55 02 00 00
..\lib\diagnostics_win.cpp(417) : {15114} normal block at 0x000002553306C260, 1080 bytes long.
Data: < > D8 1D 00 00 CD CD CD CD 8C 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {550} normal block at 0x0000025532FFBE70, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{536} normal block at 0x0000025532FFEC80, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{531} normal block at 0x0000025533009E00, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x000002553300A940, 44 bytes long.
Data: < a 3U > 01 00 00 00 00 00 CD CD 61 A9 00 33 55 02 00 00
{521} normal block at 0x000002553300A080, 44 bytes long.
Data: < 3U > 01 00 00 00 00 00 CD CD A1 A0 00 33 55 02 00 00
Object dump complete.
16:26:14 (3936): wrapper (7.9.26016): starting
16:26:14 (3936): wrapper: running python.exe (run.py)
Windows fix executed.
Detected GPUs: 1
Define environment factory
Define algorithm factory
Define storage factory
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 471, in <module>
main()
File "run.py", line 136, in main
learner.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 242, in collect_train_data
obs2, reward, done2, episode_infos = self.envs_train.step(clip_act)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\agent\env\vec_envs\vec_env_base.py", line 85, in step
return self.step_wait()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\agent\env\vec_envs\vector_wrappers.py", line 72, in step_wait
obs = torch.from_numpy(obs).float().to(self.device)
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
19:03:33 (3936): python.exe exited; CPU time 10494.140625
19:03:33 (3936): app exit status: 0x1
19:03:33 (3936): called boinc_finish(195)
0 bytes in 0 Free Blocks.
552 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 179414097 bytes.
Dumping objects ->
{16455} normal block at 0x000001D92CFBFBE0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{16414} normal block at 0x000001D92CFC08F0, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{16403} normal block at 0x000001D92CFBFF50, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{16392} normal block at 0x000001D92CFC0790, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{16381} normal block at 0x000001D92CFC0630, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{16370} normal block at 0x000001D92CFC0160, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{16289} normal block at 0x000001D92CF9C280, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {16286} normal block at 0x000001D92CFB20C0, 8 bytes long.
Data: < 8- > 00 00 38 2D D9 01 00 00
{15645} normal block at 0x000001D92CFAE470, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{15033} normal block at 0x000001D92CFB2840, 8 bytes long.
Data: <@ 7- > 40 18 37 2D D9 01 00 00
..\zip\boinc_zip.cpp(122) : {550} normal block at 0x000001D92CF9B820, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{537} normal block at 0x000001D92CFAA2D0, 32 bytes long.
Data: < , P , > B0 A9 FA 2C D9 01 00 00 50 AF FA 2C D9 01 00 00
{536} normal block at 0x000001D92CFC0580, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{531} normal block at 0x000001D92CFAA0F0, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x000001D92CFAAF50, 44 bytes long.
Data: < q , > 01 00 00 00 00 00 CD CD 71 AF FA 2C D9 01 00 00
{521} normal block at 0x000001D92CFAA9B0, 44 bytes long.
Data: < , > 01 00 00 00 00 00 CD CD D1 A9 FA 2C D9 01 00 00
{511} normal block at 0x000001D92CFBBDB0, 16 bytes long.
Data: < , > B0 AE FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{510} normal block at 0x000001D92CFAAEB0, 40 bytes long.
Data: < , input.zi> B0 BD FB 2C D9 01 00 00 69 6E 70 75 74 2E 7A 69
{503} normal block at 0x000001D92CFBCAA0, 16 bytes long.
Data: <h , > 68 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{502} normal block at 0x000001D92CFBCA10, 16 bytes long.
Data: <@ , > 40 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{501} normal block at 0x000001D92CFBCC50, 16 bytes long.
Data: < , > 18 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{500} normal block at 0x000001D92CFBB0C0, 16 bytes long.
Data: < , > F0 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{499} normal block at 0x000001D92CFBC980, 16 bytes long.
Data: < , > C8 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{498} normal block at 0x000001D92CFBAFA0, 16 bytes long.
Data: < , > A0 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x000001D92CFBBD20, 16 bytes long.
Data: <X , > 58 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x000001D92CFAAE10, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{494} normal block at 0x000001D92CFBCBC0, 16 bytes long.
Data: <0 , > 30 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x000001D92CF9C3B0, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{492} normal block at 0x000001D92CFBCE90, 16 bytes long.
Data: < , > 08 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{491} normal block at 0x000001D92CFAAFF0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{490} normal block at 0x000001D92CFBC350, 16 bytes long.
Data: < , > E0 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{489} normal block at 0x000001D92CFBC1A0, 16 bytes long.
Data: < , > B8 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{488} normal block at 0x000001D92CFBC8F0, 16 bytes long.
Data: < , > 90 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{487} normal block at 0x000001D92CFBB420, 16 bytes long.
Data: <h , > 68 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{486} normal block at 0x000001D92CFBBA50, 16 bytes long.
Data: <@ , > 40 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{485} normal block at 0x000001D92CFBC110, 16 bytes long.
Data: < , > 18 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{484} normal block at 0x000001D92CFAA730, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{483} normal block at 0x000001D92CFBC7D0, 16 bytes long.
Data: < , > F0 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x000001D92CFAA370, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{481} normal block at 0x000001D92CFBC6B0, 16 bytes long.
Data: < , > C8 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x000001D92CFAAC30, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{479} normal block at 0x000001D92CFBC620, 16 bytes long.
Data: < , > A0 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{478} normal block at 0x000001D92CFAE7A0, 480 bytes long.
Data: < , 0 , > 20 C6 FB 2C D9 01 00 00 30 AC FA 2C D9 01 00 00
{477} normal block at 0x000001D92CFBCE00, 16 bytes long.
Data: < , > 80 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{476} normal block at 0x000001D92CFBCD70, 16 bytes long.
Data: <X , > 58 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{475} normal block at 0x000001D92CFBB780, 16 bytes long.
Data: <0 , > 30 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{474} normal block at 0x000001D92CFAE6F0, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{473} normal block at 0x000001D92CFBC590, 16 bytes long.
Data: <x , > 78 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{472} normal block at 0x000001D92CFBB150, 16 bytes long.
Data: <P , > 50 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{471} normal block at 0x000001D92CFBC500, 16 bytes long.
Data: <( , > 28 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x000001D92CFBB300, 16 bytes long.
Data: < , > 00 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x000001D92CFBCCE0, 16 bytes long.
Data: < , > D8 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{468} normal block at 0x000001D92CFBCB30, 16 bytes long.
Data: < , > B0 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{467} normal block at 0x000001D92CFBC740, 16 bytes long.
Data: < , > 90 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x000001D92CFBBB70, 16 bytes long.
Data: <h , > 68 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x000001D92CFAA690, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{464} normal block at 0x000001D92CFBBAE0, 16 bytes long.
Data: <@ , > 40 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x000001D92CFA8510, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{462} normal block at 0x000001D92CFBC860, 16 bytes long.
Data: < , > 88 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x000001D92CFBB810, 16 bytes long.
Data: <` , > 60 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x000001D92CFBB030, 16 bytes long.
Data: <8 , > 38 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{459} normal block at 0x000001D92CFBC080, 16 bytes long.
Data: < , > 10 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{458} normal block at 0x000001D92CFBB9C0, 16 bytes long.
Data: < , > E8 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{457} normal block at 0x000001D92CFBE000, 16 bytes long.
Data: < , > C0 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{456} normal block at 0x000001D92CFBEB40, 16 bytes long.
Data: < , > A0 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x000001D92CFBDF70, 16 bytes long.
Data: <x , > 78 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x000001D92CFBDAF0, 16 bytes long.
Data: <P , > 50 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x000001D92CFA8460, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{452} normal block at 0x000001D92CFBDEE0, 16 bytes long.
Data: < , > 98 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x000001D92CFBD8B0, 16 bytes long.
Data: <p , > 70 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x000001D92CFBD790, 16 bytes long.
Data: <H , > 48 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x000001D92CFBDE50, 16 bytes long.
Data: < , > 20 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x000001D92CFBEAB0, 16 bytes long.
Data: < , > F8 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{447} normal block at 0x000001D92CFBD9D0, 16 bytes long.
Data: < , > D0 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{446} normal block at 0x000001D92CFBD700, 16 bytes long.
Data: < , > B0 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x000001D92CFBEA20, 16 bytes long.
Data: < , > 88 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x000001D92CFAA4B0, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{443} normal block at 0x000001D92CFBD820, 16 bytes long.
Data: <` , > 60 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x000001D92CFA38F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{441} normal block at 0x000001D92CFBDA60, 16 bytes long.
Data: < , > A8 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x000001D92CFBE900, 16 bytes long.
Data: < , > 80 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x000001D92CFBE870, 16 bytes long.
Data: <X , > 58 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{438} normal block at 0x000001D92CFBEBD0, 16 bytes long.
Data: <0 , > 30 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{437} normal block at 0x000001D92CFBE6C0, 16 bytes long.
Data: < , > 08 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{436} normal block at 0x000001D92CFBE480, 16 bytes long.
Data: < , > E0 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{435} normal block at 0x000001D92CFBD4C0, 16 bytes long.
Data: < , > C0 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x000001D92CFBDC10, 16 bytes long.
Data: < , > 98 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x000001D92CFBD430, 16 bytes long.
Data: <p , > 70 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x000001D92CFBEF70, 2976 bytes long.
Data: <0 , .\7za.ex> 30 D4 FB 2C D9 01 00 00 2E 5C 37 7A 61 2E 65 78
{69} normal block at 0x000001D92CFACC20, 16 bytes long.
Data: < ;* > 80 EA 3B 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{68} normal block at 0x000001D92CFACA70, 16 bytes long.
Data: <@ ;* > 40 E9 3B 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{67} normal block at 0x000001D92CFAC0E0, 16 bytes long.
Data: < W8* > F8 57 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{66} normal block at 0x000001D92CFAC050, 16 bytes long.
Data: < W8* > D8 57 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x000001D92CFAC9E0, 16 bytes long.
Data: <P 8* > 50 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x000001D92CFAC680, 16 bytes long.
Data: <0 8* > 30 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x000001D92CFACB00, 16 bytes long.
Data: < 8* > E0 02 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x000001D92CFAC950, 16 bytes long.
Data: < 8* > 10 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x000001D92CFAC8C0, 16 bytes long.
Data: <p 8* > 70 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x000001D92CFAC710, 16 bytes long.
Data: < 6* > 18 C0 36 2A F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 60133 - Posted: 22 Mar 2023 | 14:50:47 UTC - in response to Message 60131.
Last modified: 22 Mar 2023 | 14:53:09 UTC

it's right in your message:
"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes."

that's why.

this is a known problem with the windows app. you need to increase your virtual memory (page file) to like 50GB.

also it looks like your host only has 16GB system RAM. if you're running other things that use lots of memory (like rosetta or einstein GW CPU tasks) then you might be running out of system memory too. these python tasks need about 10GB of system memory for each one.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60298 - Posted: 8 Apr 2023 | 13:48:47 UTC

I am experiencing a strange problem on my PC with two RTX3070 inside, CPU Intel i9-10900KF (10 cores/20 threads), 128 GB RAM:
until about 2 weeks ago, I crunched 4 Python tasks concurrently (2 ea. GPU).
Then I processed ACEMD_3 and ATM tasks, the queues of which ran dry now.
So I changed back to Python - and surprise: after downloading 4 tasks, only 3 started, the fourth one stays in status "ready to start".

I had made no changes, neither in the hardware, nore in the software, nor in the settings.
Anyone any idea what I can do in order to get the fourth task to run?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60299 - Posted: 8 Apr 2023 | 16:16:24 UTC - in response to Message 60298.

Consequence of running the acemd3 and ATM tasks is that it dropped your APR rate on the host and now the client thinks that you will not be able to finish the second Python task before deadline.

You probably have the single Python task in EDF mode now.

Try adding <fraction_done_exact/> into every app section in your app_config.xml

That helps produce more realistic progress percentages and could/may persuade the client to let you run that second task on that gpu.

But you may just have to let the APR mechanism balance out again. One of the many flaws in BOINC.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60300 - Posted: 8 Apr 2023 | 16:57:26 UTC

thank you, Keith, for the explanation :-)

<fraction_done_exact/> has been in the app_config to begin with.

So I am afraid I just need to wait ...

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 60345 - Posted: 23 Apr 2023 | 13:00:19 UTC

Absolutely no usage of GPU only CPU.
task 27464783

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 60346 - Posted: 23 Apr 2023 | 13:21:39 UTC - in response to Message 60345.

for the first 5 minutes or so, there will only be CPU use and no GPU use because the task is extracting the python environment to the designated slot. after this, the task will run and start using both GPU and CPU. GPU use will be low.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 60347 - Posted: 23 Apr 2023 | 16:19:05 UTC

No, it was not at 5% but 29% and stuck. I exited BOINC and restarted. The WU is now normal at 34%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 60348 - Posted: 23 Apr 2023 | 16:23:51 UTC - in response to Message 60347.

i said 5 minutes not 5%.

but sounds like an issue with your system, not the tasks. my tasks have never gotten stuck like that.
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 150,408,397
RAC: 215,366
Level
Ile
Scientific publications
wat
Message 60349 - Posted: 24 Apr 2023 | 11:11:31 UTC - in response to Message 60348.

It is 20 minutes on my hdd.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 138
Credit: 534,774,311
RAC: 236,996
Level
Lys
Scientific publications
watwat
Message 60350 - Posted: 24 Apr 2023 | 12:02:49 UTC - in response to Message 60348.

i said 5 minutes not 5%.

but sounds like an issue with your system, not the tasks. my tasks have never gotten stuck like that.

____________--

Chill, bro. Completed and validated.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1083
Credit: 40,330,187,595
RAC: 4,846,684
Level
Trp
Scientific publications
wat
Message 60351 - Posted: 24 Apr 2023 | 12:14:05 UTC - in response to Message 60349.

It is 20 minutes on my hdd.


that makes sense for a slower device like a HDD.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60709 - Posted: 1 Sep 2023 | 5:48:28 UTC

Is the Python project dead ?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1372
Credit: 7,990,574,778
RAC: 2,443,816
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60710 - Posted: 1 Sep 2023 | 19:05:28 UTC - in response to Message 60709.

Is the Python project dead ?

Could be. Haven't seen the researcher behind those task types around for quite a while.
Could be he has moved on or maybe just taking a summer sabbatical or something.

TofPete
Send message
Joined: 17 Mar 24
Posts: 14
Credit: 63,306,570
RAC: 201,919
Level
Thr
Scientific publications
wat
Message 61744 - Posted: 28 Aug 2024 | 7:53:46 UTC

Hi,

I'm receiving the following error message after about 700-800 sec running time:

09:33:09 (32292): Library/usr/bin/tar.exe exited; CPU time 0.000000
09:33:09 (32292): wrapper: running C:/Windows/system32/cmd.exe (/c call Scripts\activate.bat && Scripts\conda-unpack.exe && run.bat)
Could not find platform independent libraries <prefix>
Python path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = '\\?\D:\ProgramData\BOINC\slots\4\python.exe'
isolated = 0
environment = 1
user site = 1
safe_path = 0
import site = 1
is in build tree = 0
stdlib dir = 'D:\ProgramData\BOINC\slots\4\Lib'
sys._base_executable = '\\\\?\\D:\\ProgramData\\BOINC\\slots\\4\\python.exe'
sys.base_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.base_exec_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.platlibdir = 'DLLs'
sys.executable = '\\\\?\\D:\\ProgramData\\BOINC\\slots\\4\\python.exe'
sys.prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.exec_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.path = [
'D:\\ProgramData\\BOINC\\slots\\4\\python311.zip',
'D:\\ProgramData\\BOINC\\slots\\4\\DLLs',
'D:\\ProgramData\\BOINC\\slots\\4\\Lib',
'\\\\?\\D:\\ProgramData\\BOINC\\slots\\4',
]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'

Current thread 0x000058b0 (most recent call first):
<no Python frame>
09:33:10 (32292): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
09:33:10 (32292): app exit status: 0x1
09:33:10 (32292): called boinc_finish(195)


Any idea why this error happens recently?

Thanks,

Peter

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,491,761,501
RAC: 19,308,652
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61745 - Posted: 28 Aug 2024 | 8:10:00 UTC - in response to Message 61744.
Last modified: 28 Aug 2024 | 8:10:41 UTC

Hi,

I'm receiving the following error message after about 700-800 sec running time:
...
Any idea why this error happens recently?

when did you receive this task from which you think it's a Python? Pythons have not been around for quite a while - just take a look at the server status page

TofPete
Send message
Joined: 17 Mar 24
Posts: 14
Credit: 63,306,570
RAC: 201,919
Level
Thr
Scientific publications
wat
Message 61746 - Posted: 28 Aug 2024 | 9:04:06 UTC
Last modified: 28 Aug 2024 | 9:04:22 UTC

I think it's a python task because the error message is regarding a python problem:

Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'


I got these tasks today and in the recent days:
Task received at in UTC | Computing status text | Runtime | Application name
28 Aug 2024 8:31:51 UTC | Error while computing | 672.11 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 8:03:54 UTC | Error while computing | 703.63 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:34:52 UTC | Error while computing | 708.96 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:20:30 UTC | Error while computing | 714.93 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 8:17:39 UTC | Error while computing | 709.18 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:49:20 UTC | Error while computing | 724.49 | ATMML: Free energy with neural networks v1.01 (cuda1121)
27 Aug 2024 9:35:49 UTC | Error while computing | 776.90 | ATMML: Free energy with neural networks v1.01 (cuda1121)
27 Aug 2024 1:24:00 UTC | Error while computing | 60.60 | ATMML: Free energy with neural networks v1.01 (cuda1121)
26 Aug 2024 9:41:56 UTC | Error while computing | 20.18 | ATMML: Free energy with neural networks v1.01 (cuda1121)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1629
Credit: 9,658,057,693
RAC: 7,041,273
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61748 - Posted: 28 Aug 2024 | 9:35:46 UTC - in response to Message 61746.

Those describe themselves as ATMML tasks - the clue is in the name.

There's been a major problem with ATMML tasks in the last 24 hours - all workunits created since around 13:00 UTC yesterday have a systemic failure which cause them to fail very early.

That's the project's problem, not your problem.

TofPete
Send message
Joined: 17 Mar 24
Posts: 14
Credit: 63,306,570
RAC: 201,919
Level
Thr
Scientific publications
wat
Message 61749 - Posted: 28 Aug 2024 | 11:50:06 UTC - in response to Message 61748.

Thank you

Those describe themselves as ATMML tasks - the clue is in the name.

There's been a major problem with ATMML tasks in the last 24 hours - all workunits created since around 13:00 UTC yesterday have a systemic failure which cause them to fail very early.

That's the project's problem, not your problem.

Post to thread

Message boards : News : Experimental Python tasks (beta) - task description

//