World Community Grid - View Thread - OpenPandemics GPU is LIVE!

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics GPU is LIVE!

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 77

[ ]

Author

This topic has been viewed 196578 times and has 76 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics GPU is LIVE!

Yesterday I got a lot of GPU-WU's all day, but not one today. I haven't changed anything in the setup. And another question. I have activated the CPU and both GPUs on my computer, but only the NVIDIA GTX1650 Super is used, the Intel UHDgraphics 630 is always idle. Can the software only use one GPU at a time?

[Apr 8, 2021 10:43:24 PM]

widdershins
Veteran Cruncher
Scotland
Joined: Apr 30, 2007
Post Count: 674
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU is LIVE!

I think the ease of support is the limiting factor here. Keep in mind that it has taken many, many, years to get another GPU project of any flavour running on WCG again. I suspect that the hurdles to porting applications to run efficiently on GPU's are not trivial or everyone would have done it by now.

Another factor is that even running using Open CL and with pauses in GPU usage, there is still not enough GPU work available to meet the demand. So arguing that CUDA is better/faster than Open CL is pointless for this project. Improved performance through better coding or porting to CUDA would take time and resources, but would not process a single extra unit per day since the volume of units available is already the limiting factor, not the code.

WCG and the researchers are both working with tight budgets. Personally I feel that if the researchers or WCG have any spare development resources, the maximum benefit would come from increasing the volume of GPU work to fully use the existing GPU capacity rather than improve the app so we had even more unused capacity.

[Apr 8, 2021 11:45:53 PM]

pokemonlover1234
Cruncher
Joined: Mar 4, 2021
Post Count: 26
Status: Offline
Project Badges:

14 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU is LIVE!

Even though we are being limited in regards to the amount of WUs available, the number of points this project is bringing in has approximately doubled from just the CPU jobs. As much as I wish that the full potential of all our available GPUs was being utilized, and we had more work to do, what we have been getting is still very significant.

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by pokemonlover1234 at Apr 8, 2021 11:53:55 PM]

[Apr 8, 2021 11:50:11 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

50 year badge for Uncovering Genome Mysteries

100 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU is LIVE!

There are a few reasons the CUDA version was not tested or used.

From early on, having only one version to maintain is a lot easier for testing and general support. Thus, the focus on OpenCL allows all 3 GPU types to participate instead of just one.

Due to this focus, the OpenCL version is actually faster than the CUDA version.

Thanks,
-Uplinger

I totally understand ease of support having to only manage and maintain a single code base. but I don't think you can definitively say that the OpenCL version is faster than the CUDA version if you never even tested it.

if you mean "faster" in terms of faster completion of the entire project by considering having a larger user-base, and not just that the nvidia-opencl is faster than nvidia-cuda (which i would strongly contest), then you'd need to weigh the speedup of CUDA vs how many AMD and Intel devices would cease to contribute. do you have any statistics to share about what percentage of total flops the project sees from each device type? without knowing that, you really can't make that kind of conclusion.

if you don't want to complicate your support efforts with two app versions for different device types, like I said, I understand. but the bigger issue a lot of us have is with the constant 0-100 behavior. and that can be better optimized. I hope the team is considering improvements rather than taking the "it's good enough" approach.

Good evening,

It appears that I have raised a few eyebrows with my statement that the OpenCL is faster than the CUDA version. This is in fact true, the CUDA version of the code runs about 20% slower on average and in some cases up to 2x slower than the OpenCL version. These are specific numbers given to me from the researchers. My statement wasn't that CUDA isn't faster than OpenCL in general, it was specific to this application. I do not know the time allotment from the researchers on the GPU code as to how much effort they put into the two different paths that are needed.

I do not claim to know how much effort would be needed to get the CUDA version to the state of OpenCL, but the code in the current state allows for OpenCL to run faster than CUDA.

As for the "it's good enough" approach, I agree that there are some things that can be improved upon. At the moment, I am making sure this current version works properly. I do have in my mind a possible way to keep the GPU running more consistently, but it is a major change to how jobs are submitted and generated. This initial version was created with the original idea to keep it as similar to the CPU version for keeping consistency between the entire pipeline scripts. By keeping it similar allows us to compare results and groupings to make sure things operate properly. Also, I would like to correct the checkpointing issue where it is not following the checkpoint rules and possibly a way to write to disk less frequently. Again, these are not minor changes and require changes to the code as well as testing from scratch.

Thanks,
-Uplinger

[Apr 9, 2021 1:23:55 AM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU is LIVE!

I'd have to guess that they maybe were using something misconfigured, or using an old cuda version or something, or the cuda code for whatever reason wasn't as optimized as the opencl variant (this can happen, the default cuda apps on SETI were slower than opencl, but they were mega old and using like cuda 5.0). CUDA 10+ on any more modern GPU would be a clear difference. but without specifics on how they compiled the application or what hardware they were testing on, I can't speculate further.

check my previous post about adding a mutex to the app. this should be able to be done and would minimize the dwell time between jobs. this would benefit all devices since the start stop behavior doesn't seem to be specific to nvidia gpus.

From what others have mentioned this because the WUs are prepackaged with multiple WUs in a single package. If this is the case, they could use something like a mutex lock to preload data and prepare for the next computation while the current one is ongoing. This will allow the GPU to remain at near 100% the entire time. We did this with the SETI CUDA app and recorded as low as 1ms (probably the limits of our measurement ability) between one WU and the next.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

----------------------------------------
[Edit 1 times, last edit by Ian-n-Steve C. at Apr 9, 2021 1:58:27 AM]

[Apr 9, 2021 1:35:35 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics GPU is LIVE!

I hope the team is considering improvements rather than taking the "it's good enough" approach.

Confucius: "Better a diamond with a flaw than a pebble without..."

[Apr 9, 2021 1:47:02 AM]

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project - Phase 2

2 year badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU is LIVE!

Congrats to the WCG admins, developers, scientists and beta testers for making this possible. The results are already being seen with an 18% increase (today) in returned results. Hope it continues to grow. Luck to all.
biggrin

Actually it is not just ~18% increase. Total computing throughput has more than doubled already from GPU app start even at the current limited GPU use scale.

Because each GPU WU include about ~10 times more modeling work compared to regular CPU WUs, you can see it in the detailed WUs logs: 15-40 jobs completed per 1 WU on GPU compared to 1-4 jobs in 1 CPU WUs.
Also by number of points granted per each validated GPU WU: it also about ~10 times more compared to regular CPU WUs.

So all who thinks that with the same execution time on the CPU and GPU, there is no point in using the GPU at all (just resource waste) is considered wrong - even with the same execution time(with really slow GPU and fast CPU), the GPU produces about 10 times more useful work. And on the decent GPU it is at least >100 times more work done compared to 1 CPU core.

It also explain why current GPU batches are small: They actually not small at all: each GPU batch contains about same (or even more) amount of work as CPU batches, but just packed in much fewer number of WUs to avoid very short WU run times and reduce WUs management "overhead".

It also explain frequent GPU load swings in 0-100%-0% manner: it is not bad optimization of the APP itself. Just one full job inside one WU completed in about a dozen seconds on modern GPUs and there is a gap while one job is finising work and another one is starting. It happens not just on the end of each WU, but also many times during one WU crunching.

----------------------------------------
[Edit 2 times, last edit by Mad_Max at Apr 9, 2021 10:29:57 PM]

[Apr 9, 2021 10:17:58 PM]

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:


Re: OpenPandemics GPU is LIVE!

I think it is not limited by WCG app but BOINC suite which manages apps.
By default setting BOINC use just one GPU for crunching (fastest / most capable one).
There is an BOINC option to allow run of GPU apps on all GPUs simultaneously located in "cc_config.xml" file in BOINC data root dir (not app dir where executables are located, but data dir):

<cc_config>
   <options>
	<use_all_gpus>1</use_all_gpus>
   </options>
</cc_config>

But also check project settings here(in WCG profile): if Intel GPU is allowed to run? There are 3 independent switches in the project setting for each major GPU vendor - NV, AMD and Intel).

[Apr 9, 2021 11:07:37 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics GPU is LIVE!

Thanks for the answer.
Both GPUs are active in the setup, but only the NVIDIA is used. Both GPUs are displayed under Projects>Properties>Scheduling. The difference between CPU and GPU is really great: ~ 3: 15 min / WU with NVIDIA. Fascinating.

----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 10, 2021 2:28:44 PM]

[Apr 10, 2021 2:25:47 PM]

[ ]