Advanced search

Message boards : News : Experimental Python tasks (beta) - task description

Author Message
abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 56977 - Posted: 17 Jun 2021 | 10:40:32 UTC

Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread.

What are we trying to accomplish?
We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings.

Why most jobs were failing a few weeks ago?
It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully.

Why are GPUs being underutilized? and why are CPU used for?
In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid.

More information:
We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future.
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 111,212,181
RAC: 269,583
Level
Cys
Scientific publications
wat
Message 56978 - Posted: 17 Jun 2021 | 11:46:18 UTC
Last modified: 17 Jun 2021 | 12:08:24 UTC

Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now.

A couple of questions though:

1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used?
2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches?
3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks?
4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model?
What do you envision to be

"problems [so far] unattainable in smaller scale settings"
?
5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other?

Thx! Keep up the great work!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 56979 - Posted: 17 Jun 2021 | 13:26:58 UTC - in response to Message 56977.

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
____________

Profile phi1258
Send message
Joined: 30 Jul 16
Posts: 4
Credit: 1,555,158,536
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwat
Message 56989 - Posted: 18 Jun 2021 | 11:21:31 UTC - in response to Message 56977.

This is a welcome advance. Looking forward to contributing.



Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56990 - Posted: 18 Jun 2021 | 12:04:08 UTC - in response to Message 56977.

Thank you very much for this advance.
I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more...
Best wishes.

_heinz
Send message
Joined: 20 Sep 13
Posts: 16
Credit: 3,433,447
RAC: 0
Level
Ala
Scientific publications
wat
Message 56994 - Posted: 20 Jun 2021 | 5:39:42 UTC
Last modified: 20 Jun 2021 | 5:43:47 UTC

Wish you sucess.
regards _heinz
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,596,105,840
RAC: 24,899,084
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 56996 - Posted: 21 Jun 2021 | 11:28:16 UTC - in response to Message 56979.

Ian&Steve C. wrote on June 17th:

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.

I am courious what the answer will be

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 57000 - Posted: 22 Jun 2021 | 12:17:47 UTC

also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization.

when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57009 - Posted: 23 Jun 2021 | 20:09:29 UTC

I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project.

MrS
____________
Scanning for our furry friends since Jan 2002

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 111,212,181
RAC: 269,583
Level
Cys
Scientific publications
wat
Message 57014 - Posted: 24 Jun 2021 | 10:32:37 UTC

This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57020 - Posted: 26 Jun 2021 | 7:53:10 UTC

https://www.youtube.com/watch?v=yhJWAdZl-Ck

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58044 - Posted: 10 Dec 2021 | 11:32:51 UTC

I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any?

Examnple:
https://www.gpugrid.net/workunit.php?wuid=27100605

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58045 - Posted: 10 Dec 2021 | 11:56:26 UTC - in response to Message 58044.

Host 132158 is getting some. The first failed with:

File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run
sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd)))
NameError: name 'cmd' is not defined
----------------------------------------
ERROR: Failed building wheel for atari-py
ERROR: Command errored out with exit status 1:
command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py
cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/

Looks like a typo.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58058 - Posted: 11 Dec 2021 | 0:23:09 UTC

Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used.

We just got put back in good graces with a whitelist at Gridcoin too.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58061 - Posted: 11 Dec 2021 | 2:16:29 UTC

@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment.
It probably needs to be exported into the userland environment.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58104 - Posted: 14 Dec 2021 | 16:55:30 UTC - in response to Message 58045.

Hello everyone, sorry for the late reply.

we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.

The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.

Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.

I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58112 - Posted: 15 Dec 2021 | 15:31:49 UTC - in response to Message 58104.

Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58114 - Posted: 15 Dec 2021 | 16:12:09 UTC - in response to Message 58112.

The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts.

It is the one that talk about a dependency called "pinocchio" and detects conflicts with it.

Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App.



____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58115 - Posted: 15 Dec 2021 | 16:56:36 UTC - in response to Message 58114.

OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58116 - Posted: 15 Dec 2021 | 19:29:54 UTC
Last modified: 15 Dec 2021 | 19:48:28 UTC

Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too.

Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine.

This host also has 16 GB system RAM: GPU is GTX 1660 Ti.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58117 - Posted: 15 Dec 2021 | 19:40:45 UTC - in response to Message 58114.
Last modified: 15 Dec 2021 | 19:43:12 UTC

I reset the project on my host. still failed.

WU: http://gpugrid.net/workunit.php?wuid=27102456

I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host.

Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual.

However, I can elaborate further to confirm that it's real.

https://i.imgur.com/XwAj4s3.png

a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up?

I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system?

abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58118 - Posted: 15 Dec 2021 | 23:25:46 UTC - in response to Message 58117.

I couldn't get your imgur image to load, just a spinner.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58119 - Posted: 16 Dec 2021 | 0:13:31 UTC - in response to Message 58118.

Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58120 - Posted: 16 Dec 2021 | 0:26:37 UTC

I've had two tasks complete on a host that was previously erroring out:

https://www.gpugrid.net/workunit.php?wuid=27102460
https://www.gpugrid.net/workunit.php?wuid=27101116

Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me.

Wildly different runtime/credit ratios, I would expect something in between.

Run time Credit Credit/sec
3,389.26 264,786.85 78/s
49,311.35 34,722.22 0.70/s

CUDA
26,635.40 420,000.00 15.77/s

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58123 - Posted: 16 Dec 2021 | 9:44:51 UTC - in response to Message 58117.

Hello everyone,

The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs.

Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive)

I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58124 - Posted: 16 Dec 2021 | 10:15:12 UTC - in response to Message 58120.

Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice.
____________

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 5,379,024,525
RAC: 29,979,635
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58125 - Posted: 16 Dec 2021 | 10:23:45 UTC - in response to Message 58123.

On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000.
That I can manage.

However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58127 - Posted: 16 Dec 2021 | 14:18:23 UTC - in response to Message 58123.

thanks for the clarification.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like

<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
</app>

you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58132 - Posted: 16 Dec 2021 | 16:56:20 UTC - in response to Message 58127.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work.

Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite.

MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58134 - Posted: 16 Dec 2021 | 18:20:00 UTC

given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB.

app_config.xml

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>5.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<max_concurrent>3</max_concurrent>
</app>
</app_config>


will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings.

abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58135 - Posted: 16 Dec 2021 | 18:22:58 UTC

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58136 - Posted: 16 Dec 2021 | 18:52:22 UTC - in response to Message 58135.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58137 - Posted: 16 Dec 2021 | 19:14:26 UTC - in response to Message 58136.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?

Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58138 - Posted: 16 Dec 2021 | 19:18:43 UTC - in response to Message 58137.

good to know. so what I experienced was pretty similar.

I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58140 - Posted: 16 Dec 2021 | 21:00:08 UTC - in response to Message 58138.

Yes primarily Universe and a few TN-Grid tasks were running also.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58141 - Posted: 17 Dec 2021 | 10:17:36 UTC - in response to Message 58134.

I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with.

For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%).

I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58143 - Posted: 17 Dec 2021 | 16:48:28 UTC - in response to Message 58141.

I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly:

https://gpugrid.net/workunit.php?wuid=27102526
https://gpugrid.net/workunit.php?wuid=27102527
https://gpugrid.net/workunit.php?wuid=27102525


GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time).
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58144 - Posted: 17 Dec 2021 | 16:54:43 UTC - in response to Message 58143.
Last modified: 17 Dec 2021 | 16:58:05 UTC

the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally.

this one succeeded on the same host as the above three.

https://gpugrid.net/workunit.php?wuid=27102535
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58145 - Posted: 17 Dec 2021 | 17:21:35 UTC - in response to Message 58144.
Last modified: 17 Dec 2021 | 17:26:54 UTC

I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58146 - Posted: 17 Dec 2021 | 18:11:26 UTC

I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely.

Using about 17GB of system memory and the gpu utilization spikes up to 97% every once in a while with periods mostly spent around 12-17% with some brief spikes around 42%.

I got one of the first batch on another host that failed fast with similar along with all the wingmen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58147 - Posted: 17 Dec 2021 | 19:29:02 UTC

these new ones must be pretty long.

been running almost 2 hours now. and a lot higher VRAM use. over 6GB per task used on the VRAM. GPUs with less than 6GB have issues?

but it also seems that some of the system memory used can be shared. running 1 task shows ~17GB system mem use, but running 5x tasks shows about 53GB system mem use. that's as far as I'll push it on my 64GB machines.
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,668,256,804
RAC: 4,936,549
Level
Phe
Scientific publications
wat
Message 58148 - Posted: 17 Dec 2021 | 21:08:46 UTC
Last modified: 17 Dec 2021 | 21:09:41 UTC

I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM).

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58149 - Posted: 17 Dec 2021 | 23:27:43 UTC - in response to Message 58148.

I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it.

I am over 6 1/2 hours now on this daily driver with 10% still showing. I bet they never show anything BUT 10% done until they finish.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58150 - Posted: 18 Dec 2021 | 0:09:18 UTC - in response to Message 58149.

I had the same feeling, Keith
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58151 - Posted: 18 Dec 2021 | 0:14:15 UTC
Last modified: 18 Dec 2021 | 0:15:02 UTC

also those of us running these, should probably prepare for VERY low credit reward.

This is something I have observed for a long time with beta tasks here. there seems to be some kind of anti-cheat mecahnism (or bug) built into BOINC when using the default credit reward scheme (based on flops), if the calculated credit reward is over some value, the credit reward gets defaulted to some very low value. since these are so long running, and beta, I fully expect to see this happen. I've reported about this behavior in the past.

would be a nice surprise if not, but I have a strong feeling it'll happen.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58152 - Posted: 18 Dec 2021 | 1:14:41 UTC - in response to Message 58151.

I got one task early on that rewarded more than reasonable credit.
But the last one was way low but I thought I read a post from @abouh that he had made a mistake in the credit award algorithm and had corrected for that.
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#58124

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58153 - Posted: 18 Dec 2021 | 2:36:47 UTC - in response to Message 58152.
Last modified: 18 Dec 2021 | 3:02:51 UTC

That task was short though. The threshold is around 2million credit reward if I remember.

I posted about it in the team forum almost exactly a year ago. Don’t want to post some details publicly because it could encourage cheating. But for a long time credit reward of the beta tasks has been inconsistent and not calculated fairly IMO. Because the credit reward was so high, I noticed a trend that when the credit reward was supposed to be high enough (extrapolating the runtime with expected reward) it triggered a very low value. This only happened on long running (and hence potential high reward) tasks. Since these tasks are so long, I just think there’s a possibility we’ll see that again.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58154 - Posted: 18 Dec 2021 | 4:53:29 UTC - in response to Message 58151.
Last modified: 18 Dec 2021 | 5:23:09 UTC

confirmed.

Keith you just reported this one.

http://www.gpugrid.net/result.php?resultid=32731284

that value of 34,722.22 is the exact same "penalty value" i noticed before a year ago. for 11hrs worth of work (clock time). and 28hrs of "cpu time". interesting that the multithreaded nature of these tasks inflates the run time so much.

extrapolating from your successful run that did not hit a penalty, I'd guess that any task longer than about 2.5hrs is gonna hit the penalty value for these tasks. they really should just use the same credit scheme as acemd3. or assign static credit scaled for expected runtime, as long as all of the tasks are about the same size.

BOINC documentation confirms my suspicions on what's happening.

https://boinc.berkeley.edu/trac/wiki/CreditNew

Peak FLOP Count

This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.

When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds.

When a client finishes a job and reports its elapsed time T, we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * peak_flops(J)

The credit for a job J is typically proportional to PFC(J), but is limited and normalized in various ways.

Notes:

PFC(J) is not reliable; cheaters can falsify elapsed time or device attributes.
We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently.
peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.


One-time cheats

For example, claiming a PFC of 1e304.

This is handled by the sanity check mechanism, which grants a default amount of credit and treats the host with suspicion for a while.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58155 - Posted: 18 Dec 2021 | 6:29:56 UTC

Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days.

@Abouh needs to be made aware of this and assign fixed credit as what they do with acemd3.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 49,823
Level
Trp
Scientific publications
watwatwat
Message 58157 - Posted: 18 Dec 2021 | 16:30:01 UTC
Last modified: 18 Dec 2021 | 16:45:56 UTC

Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios.
1. Rig-44 has an i9-9980XE 18c36t 32 GB with 16 GB swap file, SSD, and 2 x 2080 Ti's. The GPU use is so low I switched GPU usage to 0.5 for both OPNG and PG and reread config files. OPNG WUs started running and have all been reported fine. PG WUs kept running. Then I started adding back in gene_pcim WUs. When I exceeded 4 gene_pcim WUs the CPU use bands changed shape in a similar way to Rig-24 with a tight band around 30% and a number of curves bouncing off 100%.

2. Rig-26 has an E5-2699 22c44t 32 GB with 16 GB swap (yet to be used), SSD, and a 2080 Ti. I've added back 24 gene_pcim WUs and the CPU use band has moved up to 40-80% with no peaks hitting 100%. Next I changed GPU usage to 0.5 for both OPNG and PG and reread config files. Both seem to be running fine.

3. Rig-24 has an i7-6980X 10c20t 32 GB with a 16 GB swap file, SSD, and a 2080 Ti. This one has been running for 17 hours so far with the last 2 hours having all other CPU work suspended. Its CPU usage graph looks different. There's a tight band oscillating about 20% with a single band oscillating from 60 to 90%. Since PG wants 32 CPUs and this CPU only has 20 there's a constant queue for hyperthreading to feed in. I'll let this one run by itself hoping it finishes soon.

Note: TN-Grid usually runs great in Resource Zero Mode where it rarely ever sends more than one extra WU. With PG running and app_config reducing the max running WUs TN-Grid just keeps sending more WUs. Up to 280 now.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58158 - Posted: 18 Dec 2021 | 17:03:32 UTC - in response to Message 58157.
Last modified: 18 Dec 2021 | 17:11:37 UTC

I did something similar with my two 7xGPU systems.

limited to 5 tasks concurrently.

and set the app_config files up such that it would run either 3x Einstein per GPU, OR 1xEinstein + 1x GPUGRID since the resources used by both are complimentary.

set GPUGRID to 0.6 for GPU use (prevents two from running on the same GPU, 0.6+0.6 >1.0)
set Einstein to 0.33 for GPU use (allows three to run on a single GPU or one GPUGRID + one Einstein, 0.33+0.33+0.33<1.0, 0.6+0.33<1.0)

but running 5 tasks on a system with 64GB system memory was too ambitious, ram use was initially OK, but grew to fill system ram and swap (default 2GB).

if these tasks become more common and plentiful, I might consider upgrading these 7xGPU systems to 128GB RAM so that they can handle running on all GPUs at the same time, but not going to bother if the project decides to reduce the system requirements or these pop up very infrequently.

the low credit reward per unit time due to the BOINC credit fail safe default value should be fixed though. not many people will have much incentive to test out the beta tasks with 10-20x less credit per unit time.

oh and these don't checkpoint properly (they checkpoint once very early on). if you pause a task that's been running for 20hrs, it restarts from that first checkpoint 20hrs ago.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58161 - Posted: 20 Dec 2021 | 10:29:54 UTC
Last modified: 20 Dec 2021 | 13:55:24 UTC

Hello everyone,

The batch I sent on Friday was successfully completed, even if some jobs failed several times initially and got reassigned.

I went through all failed jobs. Here I summarise some errors I have seen:

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
2. Conda environment conflicts with package pinocchio. This one I talked about in a previous post. It requires resetting the app.
3. ´INTERNAL ERROR: cannot create temporary directory!´ - I understand this one could be due to a full disk.

Also, based on the feedback I will work on fixing the following things before the next batch:

1. Checkpoints will be created more often during training. So jobs can be restarted and won’t go back to the beginning.
2. Credits assigned. The idea is to progressively increase the credits until the credit return becomes similar to that of the acemd jobs. However, devising a general formula to calculate them is more complex in this case. For now it is based in the total amount of data samples gathered from the environments and used to train the AI agent, but that does not take into account the size of the agent neural networks. For now we will keep them fixed, but to solve other problems might be necessary to adjust them.

Finally, I think I was a bit too ambitious regarding the total amount of training per job. I will break jobs down in two, so they don't take that long to complete.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58162 - Posted: 20 Dec 2021 | 14:55:18 UTC - in response to Message 58161.

thanks!

I did notice that all of mine failed with exceeded time limit.

might be a good idea to increase the estimated flops size of these tasks so BOINC knows that they are large and will run for a long time.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58163 - Posted: 20 Dec 2021 | 16:44:12 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.

I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app:

Run only the selected applications
ACEMD3: yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): yes
Python Runtime (CPU, beta): yes
Python Runtime (GPU, beta): no

If no work for selected applications is available, accept work from other applications?: no

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0

RuntimeError: CUDA out of memory.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,596,105,840
RAC: 24,899,084
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58164 - Posted: 20 Dec 2021 | 17:12:00 UTC - in response to Message 58163.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 5,379,024,525
RAC: 29,979,635
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58166 - Posted: 20 Dec 2021 | 18:21:34 UTC - in response to Message 58163.

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...


I had the same problem, you need to set the 'Run test applications' to No
It looks like having that set to Yes will over ride any specific application setting you set.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58167 - Posted: 20 Dec 2021 | 19:26:34 UTC - in response to Message 58166.

Thanks, I'll try

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58168 - Posted: 20 Dec 2021 | 19:53:57 UTC - in response to Message 58164.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions.

But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,596,105,840
RAC: 24,899,084
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58169 - Posted: 21 Dec 2021 | 5:52:18 UTC - in response to Message 58168.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Would be great if they work on Windows, too :-)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58170 - Posted: 21 Dec 2021 | 9:56:28 UTC - in response to Message 58168.

Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks.

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58171 - Posted: 21 Dec 2021 | 9:57:51 UTC - in response to Message 58169.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58172 - Posted: 21 Dec 2021 | 15:44:20 UTC - in response to Message 58170.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58173 - Posted: 21 Dec 2021 | 16:47:02 UTC - in response to Message 58172.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.


not sure what happened to it. take a look.

https://gpugrid.net/result.php?resultid=32731651
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58174 - Posted: 21 Dec 2021 | 17:16:54 UTC - in response to Message 58173.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58175 - Posted: 21 Dec 2021 | 18:15:03 UTC - in response to Message 58174.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.


It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project.

(i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access)

not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend.

just my .02
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 49,823
Level
Trp
Scientific publications
watwatwat
Message 58176 - Posted: 21 Dec 2021 | 18:44:13 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,596,105,840
RAC: 24,899,084
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58177 - Posted: 21 Dec 2021 | 18:58:56 UTC - in response to Message 58171.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.

okay, sounds good; thanks for the information

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58178 - Posted: 21 Dec 2021 | 19:12:20 UTC

I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory.

Much as the previous ones. I thought the memory requirements were going to be cut in half.

Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 49,823
Level
Trp
Scientific publications
watwatwat
Message 58179 - Posted: 21 Dec 2021 | 21:21:09 UTC

Just had one that's listed as "aborted by user." I didn't abort it.
https://www.gpugrid.net/result.php?resultid=32731704

It also says "Please update your install command." I've kept my computer updated. Is this something I need to do?

What's this? Something I need to do or not?
"FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`"

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58180 - Posted: 21 Dec 2021 | 23:12:16 UTC

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

That error on 4 tasks right around 55 minutes on 3080Ti

The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC.

3070Ti has been running for 7:45 with 8% Util and same memory usage.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58181 - Posted: 22 Dec 2021 | 1:34:01 UTC - in response to Message 58179.

The ray errors are normal and can be ignored.
I completed one of the new tasks successfully. The one I commented on before.
14 hours of compute time.

I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58182 - Posted: 22 Dec 2021 | 1:40:18 UTC - in response to Message 58176.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58183 - Posted: 22 Dec 2021 | 9:29:54 UTC - in response to Message 58178.
Last modified: 22 Dec 2021 | 9:30:08 UTC

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58184 - Posted: 22 Dec 2021 | 9:37:48 UTC - in response to Message 58175.
Last modified: 22 Dec 2021 | 9:43:47 UTC

During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way.

wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58185 - Posted: 22 Dec 2021 | 9:43:04 UTC - in response to Message 58176.

Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58186 - Posted: 22 Dec 2021 | 10:07:45 UTC

My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660.

They've both completed and validated their first task, in around 10.5 / 11 hours.

But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return.

Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time.

I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58187 - Posted: 22 Dec 2021 | 11:25:13 UTC - in response to Message 58183.

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.


Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58188 - Posted: 22 Dec 2021 | 15:39:07 UTC

This shows the timing discrepancy, a few minutes before task 32731655 completed.



The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58189 - Posted: 22 Dec 2021 | 15:47:48 UTC - in response to Message 58188.

i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here.

the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58190 - Posted: 22 Dec 2021 | 16:27:52 UTC - in response to Message 58189.

Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project.

The screen shot also shows how the 'remaining time' estimate gets screwed up when the running value reaches something like 10 hours at 10%. Roll on intermediate progress reports and checkpoints.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58191 - Posted: 22 Dec 2021 | 17:05:06 UTC
Last modified: 22 Dec 2021 | 17:05:49 UTC

my system that completed a few tasks had a DCF of 36+

checkpointing also still isn't working. I had some tasks running for ~3hrs. restarted boinc and they restarted at 5mins.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58192 - Posted: 22 Dec 2021 | 18:52:57 UTC - in response to Message 58191.

checkpointing also still isn't working.

See my screenshot.

"CPU time since checkpoint: 16:24:44"

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58193 - Posted: 22 Dec 2021 | 18:59:00 UTC

I've checked a sched_request when reporting.

<result>
<name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name>
<final_cpu_time>55983.300000</final_cpu_time>
<final_elapsed_time>36202.136027</final_elapsed_time>

That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58194 - Posted: 23 Dec 2021 | 10:07:59 UTC - in response to Message 58187.

As mentioned by Ian&Steve C., GPU speed influences only partially task completion time.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

In the last batch, I reduced the total amount of agent-environment interactions gathered and processed before ending the task with respect to the previous batch, which should have reduced the completion time.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58195 - Posted: 23 Dec 2021 | 10:09:32 UTC
Last modified: 23 Dec 2021 | 10:19:03 UTC

I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

Thanks you very much for your feedback. Happy holidays to everyone!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58196 - Posted: 23 Dec 2021 | 13:16:56 UTC - in response to Message 58195.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

The jobs reach us with a workunit description:

<workunit>
<name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name>
<app_name>PythonGPU</app_name>
<version_num>401</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name>
<open_name>run.py</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name>
<open_name>input.zip</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name>
<open_name>requirements.txt</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name>
<open_name>input</open_name>
<copy_file/>
</file_ref>
</workunit>

It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that.

The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58197 - Posted: 23 Dec 2021 | 15:57:01 UTC - in response to Message 58196.
Last modified: 23 Dec 2021 | 21:34:36 UTC

I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server.

Also, I checked the progress and the checkpointing problems. They were caused by format errors.

The python scripts were logging the progress into a "progress.txt" file but apparently BOINC wants just a file "progress" without extension.

Similarly, checkpoints were being generated, but were not identified correctly since they were not called "restart.chk".

I will work on fixing these issues before the next batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58198 - Posted: 23 Dec 2021 | 19:35:37 UTC - in response to Message 58197.

Thanks @abouh for working with us in debugging your application and work units.

Nice to have a attentive and easy to work with researcher.

Looking forward to the next batch.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58200 - Posted: 23 Dec 2021 | 21:20:01 UTC - in response to Message 58194.

Thank you for your kind support.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

This behavior can be seen at some tests described at my Managing non-high-end hosts thread.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58201 - Posted: 24 Dec 2021 | 10:02:52 UTC

I just sent another batch of tasks.

I tested locally and the progress and the restart.chk files are correctly generated and updated.

rsc_fpops_est job parameter should be higher too now.

Please let us know if you think the success rate of tasks can be improved in any other way. Thanks a lot for your help.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58202 - Posted: 24 Dec 2021 | 10:35:31 UTC - in response to Message 58201.

I just sent another batch of tasks.

Thank you very much for this kind of Christmas present!

Merry Christmas to everyone crunchers worldwide 🎄✨

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58203 - Posted: 24 Dec 2021 | 11:38:42 UTC
Last modified: 24 Dec 2021 | 12:09:40 UTC

1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough!

I'll watch this one through, but after that I'll be away for a few days - happy holidays, and we'll pick up again on the other side.

Edit: Progress %age jumps to 10% after the initial unpacking phase, then increments every 0.9%. That'll do.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58204 - Posted: 24 Dec 2021 | 12:51:06 UTC - in response to Message 58201.

I tested locally and the progress and the restart.chk files are correctly generated and updated.
rsc_fpops_est job parameter should be higher too now.

In a preliminary sight of one new Python GPU task received today:
- Progress estimation is now working properly, updating by 0,9% increments.
- Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove
- Checkpointing seems to be working also, and is being stored at about every two minutes.
- Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

Well done!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58208 - Posted: 24 Dec 2021 | 16:43:12 UTC

Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage.

Well done.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58209 - Posted: 24 Dec 2021 | 17:38:21 UTC - in response to Message 58204.

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1
This task has passed the initial processing steps, and has reached the learning cycle phase.
At this point, memory usage is just at the limit of the 4 GB GPU available RAM.
Waiting to see whether this task will be succeeding or not.
System RAM usage keeps being very high.
99% of the 16 GB available RAM at this system is currently in use.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58210 - Posted: 24 Dec 2021 | 22:56:33 UTC - in response to Message 58204.

- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with

<result>
<name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name>
<final_cpu_time>59637.190000</final_cpu_time>
<final_elapsed_time>39080.805144</final_elapsed_time>

That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58218 - Posted: 29 Dec 2021 | 8:31:14 UTC

Hello,

reviewing which jobs failed in the last batches I have seen several times this error:

21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[152341] INTERNAL ERROR: cannot create temporary directory!
[152345] INTERNAL ERROR: cannot create temporary directory!
21:28:08 (152316): /usr/bin/flock exited; CPU time 0.147100
21:28:08 (152316): app exit status: 0x1
21:28:08 (152316): called boinc_finish(195


I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58219 - Posted: 29 Dec 2021 | 9:15:02 UTC - in response to Message 58218.

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?

Right.
I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986
It worked fine for all my hosts.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58220 - Posted: 29 Dec 2021 | 9:26:29 UTC - in response to Message 58219.

Thank you!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58221 - Posted: 29 Dec 2021 | 10:38:21 UTC

Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017

"During handling of the above exception, another exception occurred:"

"ValueError: probabilities are not non-negative"

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58222 - Posted: 29 Dec 2021 | 16:57:53 UTC

it seems checkpointing still isnt working correctly.

despite BOINC "claiming" that it's checkpointing X number of seconds ago, stopping BOINC and re-starting shows that it's not restarting from the checkpoint.

The task I currently have in progress was ~20% completed. stopped BOINC, and restarted and it retained the time (elapsed and CPU time) but progress reset to 10%.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58223 - Posted: 29 Dec 2021 | 17:40:37 UTC - in response to Message 58222.

I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58225 - Posted: 29 Dec 2021 | 23:05:12 UTC

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far.
If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
I grossly estimate that environment for each Python task takes about 16 GB system RAM.
I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM.
I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%.

I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58226 - Posted: 29 Dec 2021 | 23:24:23 UTC - in response to Message 58225.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58227 - Posted: 30 Dec 2021 | 14:40:09 UTC - in response to Message 58222.

Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there.


I have tested locally to stop the task and execute again the python script and it continues from the same point where it stopped. So the script seems correct.


However, I think that right after setting up the conda environment, the progress is set automatically to 10% before running my script, so I am guessing this is what is causing the problem. I have modified my code not to rely only on the progress file, since it might be overwritten after every conda setup to be at 10%.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58228 - Posted: 30 Dec 2021 | 22:35:23 UTC - in response to Message 58226.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.


The last two tasks on my system with a 3080Ti ran concurrently and completed successfully.
https://www.gpugrid.net/results.php?hostid=477247

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58248 - Posted: 6 Jan 2022 | 9:01:57 UTC

Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today):

"wandb: Waiting for W&B process to finish, PID 334655... (failed 1). Press ctrl-c to abort syncing."

"ValueError: demo dir contains more than &#194;&#180;total_buffer_demo_capacity&#194;&#180;"

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58249 - Posted: 6 Jan 2022 | 10:01:11 UTC
Last modified: 6 Jan 2022 | 10:20:07 UTC

One user mentioned that could not solve the error

INTERNAL ERROR: cannot create temporary directory!


This is the configuration he is using:

### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new
contents of the file

PrivateTmp=true

### Lines below this comment will be discarded

### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply
NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target


I was just wondering if there is any possible reason why it should not work
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58250 - Posted: 6 Jan 2022 | 12:01:13 UTC - in response to Message 58249.

I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17

The full, unmodified, contents of the file are

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is.

We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58251 - Posted: 6 Jan 2022 | 12:08:17 UTC - in response to Message 58249.

A simpler answer might be

### Lines below this comment will be discarded

so the file as posted won't do anything at all - in particular, it won't run BOINC!

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58253 - Posted: 7 Jan 2022 | 10:27:24 UTC - in response to Message 58248.

Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it.

I will do local tests and then send a small batch of short tasks to GPUGrid to test the fixed version of the scripts before sending the next big batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58254 - Posted: 7 Jan 2022 | 18:13:15 UTC

Everybody seems to be getting the same error in today's tasks:

"AttributeError: 'PPODBuffer' object has no attribute 'num_loaded_agent_demos'"

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58255 - Posted: 7 Jan 2022 | 19:48:11 UTC

I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report.

No sign of the previous error.

https://www.gpugrid.net/result.php?resultid=32732671

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58256 - Posted: 7 Jan 2022 | 19:56:15 UTC - in response to Message 58255.

Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58263 - Posted: 10 Jan 2022 | 10:26:02 UTC
Last modified: 10 Jan 2022 | 10:28:12 UTC

I just sent a batch that seems to fail with

File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients
if self.iter % self.save_demos_every == 0:
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'


For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience...

I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58264 - Posted: 10 Jan 2022 | 10:38:35 UTC - in response to Message 58263.
Last modified: 10 Jan 2022 | 10:58:56 UTC

Got one of those - failed as you describe.

Also has the error message "AttributeError: 'GWorker' object has no attribute 'batches'".

Edit - had a couple more of the broken ones, but one created at 10:40:34 UTC seems to be running OK. We'll know later!

FritzB
Send message
Joined: 7 Apr 15
Posts: 12
Credit: 2,779,641,100
RAC: 1,089,373
Level
Phe
Scientific publications
wat
Message 58265 - Posted: 10 Jan 2022 | 14:09:55 UTC - in response to Message 58264.

I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456


Stderr Ausgabe

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/45 [00:00<?, ?it/s]

concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "concurrent/futures/process.py", line 368, in _queue_management_worker
File "multiprocessing/connection.py", line 251, in recv
TypeError: __init__() missing 1 required positional argument: 'msg'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "entry_point.py", line 69, in <module>
File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
File "concurrent/futures/_base.py", line 611, in result_iterator
File "concurrent/futures/_base.py", line 439, in result
File "concurrent/futures/_base.py", line 388, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[6689] Failed to execute script entry_point
13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269
13:25:58 (6392): app exit status: 0x1
13:25:58 (6392): called boinc_finish(195)

</stderr_txt>
]]>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58266 - Posted: 10 Jan 2022 | 16:33:22 UTC - in response to Message 58264.

I errored out 12 tasks created from 10:09:55 to 10:40:06.

Those all have the batch error.

But have 3 tasks created from 10:41:01 to 11:01:56 still running normally

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58268 - Posted: 10 Jan 2022 | 19:39:01 UTC

And two of those were the batch error resends that now have failed.

Only 1 still processing that I assume is of the fixed variety. 8 hours elapsed currently.

https://www.gpugrid.net/result.php?resultid=32732855

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58269 - Posted: 10 Jan 2022 | 21:31:54 UTC - in response to Message 58268.

You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58270 - Posted: 11 Jan 2022 | 8:11:13 UTC - in response to Message 58265.
Last modified: 11 Jan 2022 | 8:11:37 UTC

I have seen this error a few times.

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58271 - Posted: 12 Jan 2022 | 1:15:57 UTC

Might be the OOM-Killer kicking in. You would need to

grep -i kill /var/log/messages*

to check if processes were killed by the OOM-Killer.

If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58272 - Posted: 12 Jan 2022 | 8:56:21 UTC

I Googled the error message, and came up with this stackoverflow thread.

The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual:

"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."

Other search results may provide further clues.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58273 - Posted: 12 Jan 2022 | 15:11:50 UTC - in response to Message 58272.
Last modified: 12 Jan 2022 | 15:24:12 UTC

Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments).

All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58274 - Posted: 12 Jan 2022 | 16:46:57 UTC
Last modified: 12 Jan 2022 | 16:55:04 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

What kernel and OS?

Linux 5.11.0-46-generic x86_64
Ubuntu 20.04.3 LTS

I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58275 - Posted: 12 Jan 2022 | 20:47:57 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

Same Python version as current mine.

In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833
It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him.
mmonnin kindly published an alternative way at his Message #57840

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58276 - Posted: 13 Jan 2022 | 2:31:57 UTC

I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks.

The recent tasks are taking a really long time
2d13h 62,2% 1070 and 1080 GPU system
2d15h 60.4% 1070 and 1080 GPU system

2x concurrently on 3080Ti
2d12h 61.3%
2d14h 60.4%

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58277 - Posted: 13 Jan 2022 | 10:45:46 UTC - in response to Message 58274.

All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment.

Here are the specs from 3 hosts that failed with the BrokenProcessPool error:

OS:
Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)]
Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)]
Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]

Memory:
32081.92 MB
32092.04 MB
9954.41 MB

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58278 - Posted: 13 Jan 2022 | 19:55:11 UTC

I have a failed task today involving pickle.

magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error.

https://www.gpugrid.net/result.php?resultid=32733573

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 113,729,139
RAC: 1,556,285
Level
Cys
Scientific publications
wat
Message 58279 - Posted: 13 Jan 2022 | 21:18:41 UTC

The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80.

From this task:


[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning:
Found GPU%d %s which is of cuda capability %d.%d.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is %d.%d.


While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58280 - Posted: 13 Jan 2022 | 21:51:08 UTC - in response to Message 58279.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58281 - Posted: 13 Jan 2022 | 22:58:05 UTC - in response to Message 58280.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.


Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58282 - Posted: 13 Jan 2022 | 23:23:11 UTC - in response to Message 58281.
Last modified: 13 Jan 2022 | 23:23:48 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 113,729,139
RAC: 1,556,285
Level
Cys
Scientific publications
wat
Message 58283 - Posted: 14 Jan 2022 | 2:21:35 UTC - in response to Message 58280.

Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58284 - Posted: 14 Jan 2022 | 9:41:19 UTC - in response to Message 58278.

Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file.

I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error.

Thank you for bringing it up! I will review the code to see if I can find any bug related to that.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58285 - Posted: 14 Jan 2022 | 20:12:28 UTC - in response to Message 58284.

This is the document I had found about fixing the BrokenProcessPool error.

https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor

I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before.

When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,730,241,741
RAC: 1,017,488
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58286 - Posted: 14 Jan 2022 | 20:25:49 UTC

@abouh: Thank you for PM me twice!
The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58288 - Posted: 15 Jan 2022 | 22:24:17 UTC - in response to Message 58225.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).

After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks:
e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458
e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477
e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441

More details at regarding Message #58287

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58289 - Posted: 17 Jan 2022 | 8:36:42 UTC

Hello everyone,

I have seen a new error in some jobs:


Traceback (most recent call last):
File "run.py", line 444, in <module>
main()
File "run.py", line 62, in main
wandb.login(key=str(args.wandb_key))
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login
configured = _login(**kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login
wlogin.configure_api_key(key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key
apikey.write_key(self._settings, key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key
api.clear_setting("anonymous", globally=True, persist=True)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting
return self.api.clear_setting(*args, **kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api
self._api = InternalApi(*self._api_args, **self._api_kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__
self._settings = Settings(
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__
self._global_settings.read([Settings._global_path()])
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path
util.mkdir_exists_ok(config_dir)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok
os.makedirs(path)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client'
18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031
18:56:50 (54609): app exit status: 0x1
18:56:50 (54609): called boinc_finish(195)

</stderr_txt>


It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58290 - Posted: 17 Jan 2022 | 9:36:10 UTC - in response to Message 58289.

My question would be: what is the working directory?

The individual line errors concern

/home/boinc-client/slots/1/...

but the final failure concerns

/var/lib/boinc-client

That sounds like a mixed-up installation of BOINC: 'home' sounds like a location for a user-mode installation of BOINC, but '/var/lib/' would be normal for a service mode installation. It's reasonable for the two different locations to have different write permissions.

What app is doing the writing in each case, and what account are they running under?

Could the final write location be hard-coded, but the others dependent on locations supplied by the local BOINC installation?

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58291 - Posted: 17 Jan 2022 | 12:51:27 UTC

Hi

I've the same issue regarding boinc-directory (boinc dir is setup to ~/boinc)

So, I cleanup ~/.conda directory and reinstall gpugridnet project to the boinc client

So , flock detect the right running boinc directory but now I have this error task

https://www.gpugrid.net/result.php?resultid=32734225

./gpugridpy/bin/python (I think this is in boinc/slots/<N>/ folder)

The WU is running and 0.43% completed but /home/<user>/boinc/slots/11/gpugridpy still empty. No data are writted .

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58292 - Posted: 17 Jan 2022 | 15:28:21 UTC - in response to Message 58290.
Last modified: 17 Jan 2022 | 15:55:31 UTC

Right so the working directory is

/home/boinc-client/slots/1/...


to which the script has full access. The script tries to create a directory to save the logs, but I guess it should not do it in

/var/lib/boinc-client


So I think the problem is just that the package I am using to log results by default saves them outside the working directory. Should be easy to fix.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58293 - Posted: 17 Jan 2022 | 15:55:05 UTC - in response to Message 58292.

BOINC has the concept of a "data directory". Absolutely everything that has to be written should be written somewhere in that directory or its sub-directories. Everything else must be assumed to be sandboxed and inaccessible.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 10,112,111
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58294 - Posted: 17 Jan 2022 | 16:17:56 UTC - in response to Message 58282.



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58295 - Posted: 17 Jan 2022 | 21:07:22 UTC - in response to Message 58294.
Last modified: 17 Jan 2022 | 21:52:59 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.


what motherboard? and what version of BOINC?, your hosts are hidden so I cannot inspect myself. PCIe enumeration and ordering can be inconsistent against consumer boards. My server boards seem to enumerate starting from the slot furthest from the CPU socket, while most consumer boards are the opposite with device0 at the slot closest to the CPU socket.

or do you perhaps run a locked coproc_info.xml file, this would prevent any GPU changes from being picked up by BOINC if it can't write to the coproc file.

edit:

also I forgot that most versions of BOINC incorrectly detect nvidia GPU memory. they will all max out at 4GB due to a bug in BOINC. So to BOINC your 1080Ti has the same amount of memory as your 1080. and since the 1080Ti is still a pascal card like the 1080, it has the same compute capability, so you're running into the same specs between them all still

to get it to sort properly, you need to fix BOINC code, or use a GPU with higher or lower compute capability. put a Turing card in the system not in the first slot and BOINC will pick it up as GPU0
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58296 - Posted: 18 Jan 2022 | 19:03:55 UTC

The tests continue. Just reported e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1, with final stats

<result>
<name>e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1</name>
<final_cpu_time>107668.100000</final_cpu_time>
<final_elapsed_time>46186.399529</final_elapsed_time>

That's an average CPU core count of 2.33 over the entire run - that's high for what is planned to be a GPU application. We can manage with that - I'm sure we all want to help develop and test the application for the coming research run - but I think it would be helpful to put more realistic usage values into the BOINC scheduler.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 58297 - Posted: 19 Jan 2022 | 9:17:03 UTC - in response to Message 58296.

It's not a GPU application. It uses both CPU and GPU.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58298 - Posted: 19 Jan 2022 | 9:49:39 UTC - in response to Message 58296.

Do you mean changing some of the BOINC parameters like it was done in the case of <rsc_fpops_est>?

Is that to better define the resources required by the tasks?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58299 - Posted: 19 Jan 2022 | 11:03:54 UTC - in response to Message 58298.

It would need to be done in the plan class definition. Toni said that you define your plan classes in C++ code, so there are some examples in Specifying plan classes in C++.

Unfortunately, the BOINC developers didn't consider your use-case of mixing CPU elements and GPU elements in the same task, so none of the examples really match - your app is a mixture of MT and CUDA classes. What we need (or at least, would like to see) at this end are realistic values for <avg_ncpus> and <coproc><count>.

FritzB
Send message
Joined: 7 Apr 15
Posts: 12
Credit: 2,779,641,100
RAC: 1,089,373
Level
Phe
Scientific publications
wat
Message 58300 - Posted: 19 Jan 2022 | 19:00:18 UTC

it seems to work better now but I've reached time limit after 1800sec
https://www.gpugrid.net/result.php?resultid=32734648


19:39:23 (6124): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58301 - Posted: 19 Jan 2022 | 20:55:08 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58302 - Posted: 19 Jan 2022 | 22:28:41 UTC - in response to Message 58301.

I'm still running them at 1 CPU plus 1 GPU. They run fine, but when they are busy on the CPU-only sections, they steal time from the CPU tasks that are running at the same time - most obviously from CPDN.

Because these tasks are defined as GPU tasks, and GPU tasks are given a higher run priority than CPU tasks by BOINC ('below normal' against 'idle'), the real CPU project will always come off worst.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58303 - Posted: 20 Jan 2022 | 0:27:39 UTC - in response to Message 58302.
Last modified: 20 Jan 2022 | 0:28:14 UTC

You could employ ProcessLasso on the apps and up their priority I suppose.

When I ran Windows, I really utilized that utility to make the apps run the way I wanted them to, and not how BOINC sets them up on its own agenda.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58304 - Posted: 20 Jan 2022 | 6:46:45 UTC - in response to Message 58301.

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I think that Python GPU App is very efficient in adapting to any amount of CPU cores, and taking profit of available CPU resources.
This seems to be in some way independent of ncpus parameter at Gpugrid app_config.xml

Setup at my twin GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.49</cpu_usage>
</gpu_versions>
</app>

And setup for my triple GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.33</cpu_usage>
</gpu_versions>
</app>

The finality for this is being able to respectively run two or three concurrent Python GPU tasks without reaching a full "1" CPU core (2 x 0.49 = 0.98; 3 x 0.33 = 0.99). Then, I manually control CPU usage by setting "Use at most XX % of the CPUs" at BOINC Manager for each system, according to its amount of CPU cores.
This allows me to run concurrently "N" Python GPU tasks and a fixed number of other CPU tasks as desired.
But as said, Gpugrid Python GPU app seems to take CPU resources as needed for successfully processing its tasks... at the cost of slowing down the other CPU applications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58305 - Posted: 20 Jan 2022 | 7:44:41 UTC

Yes, I use Process Lasso on all my Windows machines, but I haven't explored its use under Linux.

Remember that ncpus and similar has no effect whatsoever on the actual running of a BOINC project app - there is no 'control' element to its operation. The only effect it has is on BOINC's scheduling - how many tasks are allowed to run concurrently.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58306 - Posted: 20 Jan 2022 | 15:58:45 UTC - in response to Message 58300.

This message

19:39:23 (6124): task /usr/bin/flock reached time limit 1800


Indicates that, after 30 minutes, the installation of miniconda and the task environment setup have not been finished.

Consequently, python is not found later on to execute the task since it is one of the requirements of the miniconda environment.

application ./gpugridpy/bin/python missing


Therefore, it is not an error in itself, it just means that the miniconda setup went too slow for some reason (in theory 30 minutes should be enough time). Maybe the machine is slower than usual for some reason. Or the connection is slow and dependencies are not being downloaded.

We could extend this timeout, but normally if 30 minutes is not enough for the miniconda setup another underlying problem could exists.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58307 - Posted: 20 Jan 2022 | 16:18:58 UTC - in response to Message 58306.

it seems to be a reasonably fast system. my guess is another type of permissions issue which is blocking the python install and it hits the timeout, or the CPUs are being too heavily used and not giving enough resources to the extraction process.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58308 - Posted: 20 Jan 2022 | 22:15:20 UTC - in response to Message 58305.

There is no Linux equivalent of Process Lasso.

But there is a Linux equivalent of Windows Process-Explorer

https://github.com/wolfc01/procexp

Screenshots of the application at the old SourceForge repo.

https://sourceforge.net/projects/procexp/

Can dynamically change the nice value of the application.

There is also the command line schedtool utility that can be easily implemented in a bash file. I used to run that all the time in my gpuoverclock.sh script for Seti cpu and gpu apps.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58309 - Posted: 21 Jan 2022 | 12:14:55 UTC - in response to Message 58308.

Well, that got me a long way.

There are dependencies listed for Mint 18.3 - I'm running Mint 20.2

The apt-get for the older version of Mint returns

E: Unable to locate package python-qwt5-qt4
E: Unable to locate package python-configobj

Unsurprisingly, the next step returns

Traceback (most recent call last):
File "./procexp.py", line 27, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets, uic
ModuleNotFoundError: No module named 'PyQt5'

htop, however, shows about 30 multitasking processes spawned from main, each using around 2% of a CPU core (varying by the second) at nice 19. At the time of inspection, that is. I'll go away and think about that.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58310 - Posted: 21 Jan 2022 | 17:41:41 UTC - in response to Message 58300.

I've one task now that had the same timeout issue getting python. The host was running fine on these tasks before and I don't know what has changed.

I've aborted a couple tasks now that are not making any progress after 20 hours or so and are stuck at 13% completion. Similar series tasks are showing much more progress after only a few minutes. Most complete in 5-6 hours.

I reset the project thinking something got corrupted in the downloaded libraries but that has not fixed anything.

Need to figure out how to debug the tasks on this host.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58311 - Posted: 21 Jan 2022 | 17:42:23 UTC - in response to Message 58309.

You might look into schedtool as an alternative.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 49,823
Level
Trp
Scientific publications
watwatwat
Message 58317 - Posted: 29 Jan 2022 | 21:23:39 UTC - in response to Message 58301.
Last modified: 29 Jan 2022 | 22:08:45 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.
Very interesting. Does this actually limit PythonGPU to using at most 5 cpu threads?
Does it work better than:
<app_config>
<!-- i9-7980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<avg_ncpus>5</avg_ncpus>
<cmdline>--nthreads 5</cmdline>
<fraction_done_exact/>
</app>
</app_config>
Edit 1: To answer my own question I changed cpu_usage to 5 and am running a single PythonGPU WU with nothing else going on. The System Monitor shows 5 CPUs are running in the 60 to 80% range with all othe CPU running in the 10 to 40% range.
Is there any way to stop it from taking over ones entire computer?
Edit 2: I turned on WCG and the group of 5 went up to 100% and all the rest went to OPN in the 80 to 95% range.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58318 - Posted: 30 Jan 2022 | 5:24:25 UTC - in response to Message 58317.

No. Setting that value won’t change how much CPU is actually used. It just tells BOINC how much of the CPU is being used so that it can probably account resources.

This app will use 32 threads and there’s nothing you can do in BOINC configuration to change that. This has always been the case though.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58320 - Posted: 2 Feb 2022 | 22:06:09 UTC

This morning, in a routine system update, I noticed that BOINC Client / Manager was updated from Version 7.16.17 to Version 7.18.1.
It would be interesting to know if PrivateTmp=true is set as a default at this new version, thus in some way helping for Python GPU task to succeed...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58321 - Posted: 2 Feb 2022 | 23:06:32 UTC - in response to Message 58320.

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good (it's been available for Android since August last year), but I don't yet know the answer to your specific question - there hasn't been any chatter about testing or new releases in the usual places.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58322 - Posted: 2 Feb 2022 | 23:47:29 UTC - in response to Message 58321.
Last modified: 2 Feb 2022 | 23:50:53 UTC

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good

It bombed out on the Rosetta pythons; they did not run at all (a VBox problem undoubtedly). And it failed all the validations on QuChemPedIA, which does not use VirtualBox on the Linux version. But it works OK on CPDN, WCG/ARP and Einstein/FGRBP (GPU). All were on Ubuntu 20.04.3.

So be prepared to bail out if you have to.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58324 - Posted: 3 Feb 2022 | 6:29:43 UTC - in response to Message 58321.

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58325 - Posted: 3 Feb 2022 | 9:25:23 UTC - in response to Message 58324.

My PPA gives slightly more information on the available update:



I know that it's auto-generated from the Debian package maintenance sources, which is probably the ultimate source of the Ubuntu LTS package as well. I've had a quick look round, but there's no sign so far that this release was originated by BOINC developers: in particular, no mention was made of it during the BOINC projects conference call on January 14th 2022. I'll keep digging.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58327 - Posted: 3 Feb 2022 | 12:13:36 UTC
Last modified: 3 Feb 2022 | 12:34:19 UTC

OK, I've taken a deep breath and enough coffee - applied all updates.

WARNING - the BOINC update appears to break things.

The new systemd file, in full, is

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true #Block X11 idle detection

[Install]
WantedBy=multi-user.target

Note the line I've picked out. That starts with a # sign, for comment, so it has no effect: PrivateTmp is undefined in this file.

New work became available just as I was preparing to update, so I downloaded a task and immediately suspended it. After the updates, and enough reboots to get my NVidia drivers functional again (it took three this time), I restarted BOINC and allowed the task to run.

Task 32736884

Our old enemy "INTERNAL ERROR: cannot create temporary directory!" is back. Time for a systemd over-ride file, and to go fishing for another task.

Edit - updated the file, as described in message 58312, and got task 32736938. That seems to be running OK, having passed the 10% danger point. Result will be in sometime after midnight.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58328 - Posted: 3 Feb 2022 | 23:34:25 UTC

I see your task completed normally with the PrivateTmp=true uncommented in the service file.

But is the repeating warning:

wandb: WARNING Path /var/lib/boinc-client/slots/11/.config/wandb/wandb/ wasn't writable, using system temp directory

a normal entry for those using the standard BOINC location installation?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58329 - Posted: 4 Feb 2022 | 9:04:58 UTC - in response to Message 58328.

No, that's the first time I've seen that particular warning. The general structure is right for this machine, but it does't usually reach as high as 11 - GPUGrid normally gets slot 7. Whatever - there were some tasks left waiting after the updates and restarts.

I think this task must have run under a revised version of the app - the next stage in testing. The output is slightly different in other ways, and the task ran for a significantly shorter time than other recent tasks. My other machine, which hasn't been updated yet, got the same warnings in a task running at the same time.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58330 - Posted: 4 Feb 2022 | 9:14:25 UTC - in response to Message 58328.
Last modified: 4 Feb 2022 | 9:23:48 UTC

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58331 - Posted: 4 Feb 2022 | 9:25:40 UTC - in response to Message 58329.

Yes, this experiments is with a slightly modified version of the algorithm, which should be faster. It runs the same number of interactions with the reinforcement learning environment, so the credits amount is the same.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58332 - Posted: 4 Feb 2022 | 9:38:39 UTC - in response to Message 58330.

I'll take a look at the contents of the slot directory, next time I see a task running. You're right - the entire '/var/lib/boinc-client/slots/n/...' structure should be writable, to any depth, by any program running under the boinc user account.

How is the '.config/wandb/wandb/' component of the path created? The doubled '/wandb' looks unusual.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58333 - Posted: 4 Feb 2022 | 9:44:30 UTC - in response to Message 58332.
Last modified: 4 Feb 2022 | 9:55:30 UTC

The directory paths are defined as environment variables in the python script.

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.path.join(os.getcwd(), ".config/wandb")


Then the directories are created by the wandb python package (which handles logging of relevant training data). I suspect it could be in the creation that the permissions are defined. So it is not a BOINC problem. I will change the paths in future jobs to:

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.getcwd()


Note that "os.getcwd()" is the working directory, so "/var/lib/boinc-client/slots/11/" in this case
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58334 - Posted: 4 Feb 2022 | 13:32:42 UTC - in response to Message 58330.

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.


what happens if that directory doesn't exist? several of us run BOINC in a different location. since it's in /var/lib/ the process wont have permissions to create the directory, unless maybe if BOINC is run as root.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58335 - Posted: 4 Feb 2022 | 14:22:26 UTC - in response to Message 58334.

'/var/lib/boinc-client/' is the default BOINC data directory for Ubuntu BOINC service (systemd) installations. It most certainly exists, and is writable, on my machine, which is where Keith first noticed the error message in the report of a successful run. During that run, much will have been written to .../slots/11

Since abouh is using code to retrieve the working (i.e. BOINC slot) directory, the correct value should be returned for non-default data locations - otherwise BOINC wouldn't be able to run at all.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58336 - Posted: 4 Feb 2022 | 15:33:49 UTC - in response to Message 58335.
Last modified: 4 Feb 2022 | 15:39:39 UTC

I'm aware it's the default location on YOUR computer, and others running the standard ubuntu repository installer. but the message from abouh sounded like this directory was hard coded since he put the entire path. and for folks running BOINC in another location, this directory will not be the same. if it uses a relative file path, then it's fine, but I was seeking clarification.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58337 - Posted: 4 Feb 2022 | 15:59:00 UTC - in response to Message 58336.
Last modified: 4 Feb 2022 | 16:21:03 UTC

Hard path coding was removed before this most recent test batch.

edit - see message 58292: "Should be easy to fix".

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58338 - Posted: 4 Feb 2022 | 22:13:21 UTC - in response to Message 58336.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.


Yes. I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client


I also do these to allow monitoring by BoincTasks over the LAN on my Win10 machine:
• Copy “cc_config.xml” to /etc/boinc-client folder
• Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder
• Reboot

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58339 - Posted: 5 Feb 2022 | 9:10:09 UTC - in response to Message 58334.
Last modified: 5 Feb 2022 | 11:01:11 UTC

The directory should be created wherever you run BOINC, that is not a problem.

Inside the /boinc-client directory, but it does not matter if this directory is in /var/lib/ or somewhere else.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58340 - Posted: 5 Feb 2022 | 11:05:20 UTC - in response to Message 58338.
Last modified: 5 Feb 2022 | 11:05:38 UTC

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58341 - Posted: 5 Feb 2022 | 11:50:02 UTC - in response to Message 58327.
Last modified: 5 Feb 2022 | 12:07:55 UTC

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58342 - Posted: 5 Feb 2022 | 13:27:24 UTC - in response to Message 58340.

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

I am on an isolated network behind a firewall/router. No problem at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58343 - Posted: 5 Feb 2022 | 13:28:42 UTC - in response to Message 58342.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58344 - Posted: 5 Feb 2022 | 13:30:13 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

All I know is that the new build does not work at all on Cosmology with VirtualBox 6.1.32. A work unit just suspends immediately on startup.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58345 - Posted: 5 Feb 2022 | 13:30:54 UTC - in response to Message 58343.
Last modified: 5 Feb 2022 | 13:33:37 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58346 - Posted: 5 Feb 2022 | 13:34:08 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58347 - Posted: 5 Feb 2022 | 13:40:51 UTC - in response to Message 58345.
Last modified: 5 Feb 2022 | 13:41:07 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58348 - Posted: 5 Feb 2022 | 13:56:12 UTC - in response to Message 58347.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58349 - Posted: 5 Feb 2022 | 14:08:17 UTC - in response to Message 58348.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58350 - Posted: 5 Feb 2022 | 14:11:10 UTC - in response to Message 58349.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

What comparable isolation do you get in Windows from one program to another?
Or what security are you talking about? Port security from external sources?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58351 - Posted: 5 Feb 2022 | 15:28:34 UTC - in response to Message 58350.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?
What comparable isolation do you get in Windows from one program to another?
Security descriptors introduced into the NTFS 1.2 file system released in 1996 with Windows NT 4.0. The access control lists in NTFS are more complex in some aspects than in Linux. All modern Windows use NTFS by default.
User Account Control is introduced in 2007 with Windows Vista (=apps doesn't run as administrator even if the user has administrative privileges until the user elevates it through an annoying popup)
Or what security are you talking about? Port security from external sources?
Windows firewall is introced with Windows XP SP2 in 2004.

This is my last post in this thread about (undermining) filesystem security.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58352 - Posted: 5 Feb 2022 | 16:53:05 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Updated my second machine. It appears that this re-release is NOT releated to the systemd problem: the PrivateTmp=true line is still commented out.

Re-apply the fix (#1) from message 58312 after applying this update, if you wish to continue running the Python test apps.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58353 - Posted: 5 Feb 2022 | 16:54:05 UTC - in response to Message 58351.
Last modified: 5 Feb 2022 | 17:25:41 UTC

I think you are correct, except in the term "undermining", which is not appropriate for isolated crunching machines. There is a billion-dollar AV industry for Windows. Apparently someone has figured out how to undermine it there. But I agree that no more posts are necessary.

EDIT: I probably should have said that it was only for isolated crunching machines at the outset. If I were running a server, I would do it differently.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58354 - Posted: 5 Feb 2022 | 18:15:50 UTC
Last modified: 5 Feb 2022 | 18:16:08 UTC

While chmod 777-ing in general is a bad practice. There’s little harm in blowing up the BOINC directory like that. Worst that can happen is you modify or delete a necessary file by accident and break BOINC. Just reinstall and learn the lesson. Not the end of the world in this instance.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58355 - Posted: 5 Feb 2022 | 19:20:07 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.


Ubuntu 20.04.3 LTS is still on the older 7.16.6 version.

apt list boinc-client
Listing... Done
boinc-client/focal 7.16.6+dfsg-1 amd64

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58356 - Posted: 5 Feb 2022 | 19:26:13 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1076
Credit: 40,231,533,983
RAC: 119
Level
Trp
Scientific publications
wat
Message 58357 - Posted: 5 Feb 2022 | 22:22:11 UTC - in response to Message 58356.

I think they use a different PPA, not the standard Ubuntu version.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,304,135,139
RAC: 3,392,378
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58358 - Posted: 5 Feb 2022 | 22:52:53 UTC - in response to Message 58356.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.
It's from http://ppa.launchpad.net/costamagnagianfranco/boinc/ubuntu
Sorry for the confusion.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,041,343,140
RAC: 16,958,283
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58359 - Posted: 5 Feb 2022 | 23:07:14 UTC - in response to Message 58357.

I think they use a different PPA, not the standard Ubuntu version.

You're right. I've checked, and this is my complete repository listing.
There are new pending updates for BOINC package, but I've recently catched an ACEMD3 ADRIA new task, and I'm not updating until it be finished and reported.
My experience warns that these tasks are highly prone to fail if something is changed while processing.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58360 - Posted: 6 Feb 2022 | 8:10:43 UTC - in response to Message 58324.
Last modified: 6 Feb 2022 | 8:15:37 UTC

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Ah. Your reply here gave me a different impression. Slight egg on face, but both our Linux update manager screenshots fail to give source information in their consolidated update lists. Maybe we should put in a feature request?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58361 - Posted: 6 Feb 2022 | 12:39:46 UTC
Last modified: 6 Feb 2022 | 12:40:31 UTC

ACEMD3 task finished on my original machine, so I updated BOINC from PPA 2022-01-30 to 2022-02-04.

I can confirm that if you used systemctl/edit to create a separate over-ride file, it remains in place - no need to re-edit every time. If you used a text editor to edit the raw systemd file in place, of course, it'll get over-written and will need editing again.

(final proof-of-the-pudding of that last statement awaits the release of the next test batch)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58362 - Posted: 6 Feb 2022 | 17:13:30 UTC

Got a new task (task 32738148). Running normally, confirms override to systemd is preserved.

Getting entries in stderr as before:

wandb: WARNING Path /var/lib/boinc-client/slots/7/.config/wandb/wandb/ wasn't writable, using system temp directory

(we're back in slot 7 as usual)

There are six folders created in slot 7:

agent_demos
gpugridpy
int_demos
monitor_logs
python_dependencies
ROMS

There are no hidden folders, and certainly no .config

wandb data is in:

/tmp/systemd-private-f670b90d460b4095a25c37b7348c6b93-boinc-client.service-7Jvpgh/tmp

There are 138 folders in there, including one called simply wandb

wandb contains:

debug-internal.log
debug.log
latest-run
run-20220206_163543-1wmmcgi5

The first two are files, the last two are folders. There is no subfolder called wandb - so no recursion, such as the warning message suggests. Hope that helps.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58363 - Posted: 7 Feb 2022 | 8:13:08 UTC - in response to Message 58362.

Thanks! the content of the slot directory is correct.

The wandb directory will be also placed in the slot directory soon, in the next experiment. During the current experiment, which consists of multiple batches of tasks, the wandb directory will be still in /tmp, as a result of the warning.

That is not a problem per se, but I agree that will be cleaner to place it in the slot directory, so all BOINC files are there.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58364 - Posted: 9 Feb 2022 | 9:56:19 UTC - in response to Message 58363.

wandb: Run data is saved locally in /var/lib/boinc-client/slots/7/wandb/run-20220209_082943-1pdoxrzo

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58365 - Posted: 10 Feb 2022 | 9:33:48 UTC - in response to Message 58364.
Last modified: 10 Feb 2022 | 9:34:28 UTC

Great, thanks a lot for the confirmation. So now it seems the directory is appropriate one.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 113,729,139
RAC: 1,556,285
Level
Cys
Scientific publications
wat
Message 58367 - Posted: 17 Feb 2022 | 17:38:34 UTC

Pretty happy to see that my little Quadro K620s could actually handle one of the ABOU work units. Successfully ran one in under 31 hours. It didn't hit the memory too hard, which helps. The K620 has a DDR3 memory bus so the bandwidth is pretty limited.

http://www.gpugrid.net/result.php?resultid=32741283

Though, it did fail one of the Anaconda work units that went out. The error message doesn't mean much to me.

http://www.gpugrid.net/result.php?resultid=32741757


Traceback (most recent call last):
File "run.py", line 40, in <module>
assert os.path.exists('output.coor')
AssertionError
11:22:33 (1966061): ./gpugridpy/bin/python exited; CPU time 0.295254
11:22:33 (1966061): app exit status: 0x1
11:22:33 (1966061): called boinc_finish(195)

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,714,356,441
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58368 - Posted: 17 Feb 2022 | 20:12:35 UTC

All tasks goes in errors on this machine : https://www.gpugrid.net/results.php?hostid=591484

I specify that the machine does not have a GPU usable by BOINC.

Thanks for your help.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 9,096,883,853
RAC: 17,989,612
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58369 - Posted: 18 Feb 2022 | 10:27:49 UTC - in response to Message 58368.

I got two of those yesterday as well. They are described as "Anaconda Python 3 Environment v4.01 (mt)" - declared to run as multi-threaded CPU tasks. I do have working GPUs (on host 508381), but I don't think these tasks actually need a GPU.

The task names refer to a different experimenter (RAIMIS) from the ones we've been discussing recently in this thread.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1354
Credit: 7,798,542,955
RAC: 9,221,356
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58370 - Posted: 18 Feb 2022 | 18:55:22 UTC

We were running those kind of tasks a year ago. Looks like the researcher has made an appearance again.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58371 - Posted: 18 Feb 2022 | 21:12:05 UTC
Last modified: 18 Feb 2022 | 21:47:13 UTC

I just downloaded one, but it errored out before I could even catch it starting. It ran for 3 seconds, required four cores of a Ryzen 3950X on Ubuntu 20.04.3, and had an estimated time of 2 days. I think they have some work to do.
http://www.gpugrid.net/result.php?resultid=32742752

PS
- It probably does not help that that machine is running BOINC 7.18.1. I have had problems with it before. I will try 7.16.6 later.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58372 - Posted: 18 Feb 2022 | 22:14:30 UTC - in response to Message 58371.
Last modified: 18 Feb 2022 | 22:15:49 UTC

PPS - It ran for two minutes on an equivalent Ryzen 3950X running BOINC 7.16.6, and then errored out.