Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 77
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yesterday I got a lot of GPU-WU's all day, but not one today. I haven't changed anything in the setup. And another question. I have activated the CPU and both GPUs on my computer, but only the NVIDIA GTX1650 Super is used, the Intel UHDgraphics 630 is always idle. Can the software only use one GPU at a time?
|
||
|
widdershins
Veteran Cruncher Scotland Joined: Apr 30, 2007 Post Count: 674 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I totally understand ease of support having to only manage and maintain a single code base. but I don't think you can definitively say that the OpenCL version is faster than the CUDA version if you never even tested it. if you mean "faster" in terms of faster completion of the entire project by considering having a larger user-base, and not just that the nvidia-opencl is faster than nvidia-cuda (which i would strongly contest), then you'd need to weigh the speedup of CUDA vs how many AMD and Intel devices would cease to contribute. do you have any statistics to share about what percentage of total flops the project sees from each device type? without knowing that, you really can't make that kind of conclusion. I think the ease of support is the limiting factor here. Keep in mind that it has taken many, many, years to get another GPU project of any flavour running on WCG again. I suspect that the hurdles to porting applications to run efficiently on GPU's are not trivial or everyone would have done it by now. Another factor is that even running using Open CL and with pauses in GPU usage, there is still not enough GPU work available to meet the demand. So arguing that CUDA is better/faster than Open CL is pointless for this project. Improved performance through better coding or porting to CUDA would take time and resources, but would not process a single extra unit per day since the volume of units available is already the limiting factor, not the code. WCG and the researchers are both working with tight budgets. Personally I feel that if the researchers or WCG have any spare development resources, the maximum benefit would come from increasing the volume of GPU work to fully use the existing GPU capacity rather than improve the app so we had even more unused capacity. |
||
|
pokemonlover1234
Cruncher Joined: Mar 4, 2021 Post Count: 26 Status: Offline Project Badges: ![]() ![]() ![]() |
Even though we are being limited in regards to the amount of WUs available, the number of points this project is bringing in has approximately doubled from just the CPU jobs. As much as I wish that the full potential of all our available GPUs was being utilized, and we had more work to do, what we have been getting is still very significant.
----------------------------------------![]() [Edit 2 times, last edit by pokemonlover1234 at Apr 8, 2021 11:53:55 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There are a few reasons the CUDA version was not tested or used. From early on, having only one version to maintain is a lot easier for testing and general support. Thus, the focus on OpenCL allows all 3 GPU types to participate instead of just one. Due to this focus, the OpenCL version is actually faster than the CUDA version. Thanks, -Uplinger I totally understand ease of support having to only manage and maintain a single code base. but I don't think you can definitively say that the OpenCL version is faster than the CUDA version if you never even tested it. if you mean "faster" in terms of faster completion of the entire project by considering having a larger user-base, and not just that the nvidia-opencl is faster than nvidia-cuda (which i would strongly contest), then you'd need to weigh the speedup of CUDA vs how many AMD and Intel devices would cease to contribute. do you have any statistics to share about what percentage of total flops the project sees from each device type? without knowing that, you really can't make that kind of conclusion. if you don't want to complicate your support efforts with two app versions for different device types, like I said, I understand. but the bigger issue a lot of us have is with the constant 0-100 behavior. and that can be better optimized. I hope the team is considering improvements rather than taking the "it's good enough" approach. Good evening, It appears that I have raised a few eyebrows with my statement that the OpenCL is faster than the CUDA version. This is in fact true, the CUDA version of the code runs about 20% slower on average and in some cases up to 2x slower than the OpenCL version. These are specific numbers given to me from the researchers. My statement wasn't that CUDA isn't faster than OpenCL in general, it was specific to this application. I do not know the time allotment from the researchers on the GPU code as to how much effort they put into the two different paths that are needed. I do not claim to know how much effort would be needed to get the CUDA version to the state of OpenCL, but the code in the current state allows for OpenCL to run faster than CUDA. As for the "it's good enough" approach, I agree that there are some things that can be improved upon. At the moment, I am making sure this current version works properly. I do have in my mind a possible way to keep the GPU running more consistently, but it is a major change to how jobs are submitted and generated. This initial version was created with the original idea to keep it as similar to the CPU version for keeping consistency between the entire pipeline scripts. By keeping it similar allows us to compare results and groupings to make sure things operate properly. Also, I would like to correct the checkpointing issue where it is not following the checkpoint rules and possibly a way to write to disk less frequently. Again, these are not minor changes and require changes to the code as well as testing from scratch. Thanks, -Uplinger |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
There are a few reasons the CUDA version was not tested or used. From early on, having only one version to maintain is a lot easier for testing and general support. Thus, the focus on OpenCL allows all 3 GPU types to participate instead of just one. Due to this focus, the OpenCL version is actually faster than the CUDA version. Thanks, -Uplinger I totally understand ease of support having to only manage and maintain a single code base. but I don't think you can definitively say that the OpenCL version is faster than the CUDA version if you never even tested it. if you mean "faster" in terms of faster completion of the entire project by considering having a larger user-base, and not just that the nvidia-opencl is faster than nvidia-cuda (which i would strongly contest), then you'd need to weigh the speedup of CUDA vs how many AMD and Intel devices would cease to contribute. do you have any statistics to share about what percentage of total flops the project sees from each device type? without knowing that, you really can't make that kind of conclusion. if you don't want to complicate your support efforts with two app versions for different device types, like I said, I understand. but the bigger issue a lot of us have is with the constant 0-100 behavior. and that can be better optimized. I hope the team is considering improvements rather than taking the "it's good enough" approach. Good evening, It appears that I have raised a few eyebrows with my statement that the OpenCL is faster than the CUDA version. This is in fact true, the CUDA version of the code runs about 20% slower on average and in some cases up to 2x slower than the OpenCL version. These are specific numbers given to me from the researchers. My statement wasn't that CUDA isn't faster than OpenCL in general, it was specific to this application. I do not know the time allotment from the researchers on the GPU code as to how much effort they put into the two different paths that are needed. I do not claim to know how much effort would be needed to get the CUDA version to the state of OpenCL, but the code in the current state allows for OpenCL to run faster than CUDA. As for the "it's good enough" approach, I agree that there are some things that can be improved upon. At the moment, I am making sure this current version works properly. I do have in my mind a possible way to keep the GPU running more consistently, but it is a major change to how jobs are submitted and generated. This initial version was created with the original idea to keep it as similar to the CPU version for keeping consistency between the entire pipeline scripts. By keeping it similar allows us to compare results and groupings to make sure things operate properly. Also, I would like to correct the checkpointing issue where it is not following the checkpoint rules and possibly a way to write to disk less frequently. Again, these are not minor changes and require changes to the code as well as testing from scratch. Thanks, -Uplinger I'd have to guess that they maybe were using something misconfigured, or using an old cuda version or something, or the cuda code for whatever reason wasn't as optimized as the opencl variant (this can happen, the default cuda apps on SETI were slower than opencl, but they were mega old and using like cuda 5.0). CUDA 10+ on any more modern GPU would be a clear difference. but without specifics on how they compiled the application or what hardware they were testing on, I can't speculate further. check my previous post about adding a mutex to the app. this should be able to be done and would minimize the dwell time between jobs. this would benefit all devices since the start stop behavior doesn't seem to be specific to nvidia gpus. From what others have mentioned this because the WUs are prepackaged with multiple WUs in a single package. If this is the case, they could use something like a mutex lock to preload data and prepare for the next computation while the current one is ongoing. This will allow the GPU to remain at near 100% the entire time. We did this with the SETI CUDA app and recorded as low as 1ms (probably the limits of our measurement ability) between one WU and the next. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti [Edit 1 times, last edit by Ian-n-Steve C. at Apr 9, 2021 1:58:27 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I hope the team is considering improvements rather than taking the "it's good enough" approach. Confucius: "Better a diamond with a flaw than a pebble without..." |
||
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Congrats to the WCG admins, developers, scientists and beta testers for making this possible. The results are already being seen with an 18% increase (today) in returned results. Hope it continues to grow. Luck to all. ![]() Actually it is not just ~18% increase. Total computing throughput has more than doubled already from GPU app start even at the current limited GPU use scale. Because each GPU WU include about ~10 times more modeling work compared to regular CPU WUs, you can see it in the detailed WUs logs: 15-40 jobs completed per 1 WU on GPU compared to 1-4 jobs in 1 CPU WUs. Also by number of points granted per each validated GPU WU: it also about ~10 times more compared to regular CPU WUs. So all who thinks that with the same execution time on the CPU and GPU, there is no point in using the GPU at all (just resource waste) is considered wrong - even with the same execution time(with really slow GPU and fast CPU), the GPU produces about 10 times more useful work. And on the decent GPU it is at least >100 times more work done compared to 1 CPU core. It also explain why current GPU batches are small: They actually not small at all: each GPU batch contains about same (or even more) amount of work as CPU batches, but just packed in much fewer number of WUs to avoid very short WU run times and reduce WUs management "overhead". It also explain frequent GPU load swings in 0-100%-0% manner: it is not bad optimization of the APP itself. Just one full job inside one WU completed in about a dozen seconds on modern GPUs and there is a gap while one job is finising work and another one is starting. It happens not just on the end of each WU, but also many times during one WU crunching. [Edit 2 times, last edit by Mad_Max at Apr 9, 2021 10:29:57 PM] |
||
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yesterday I got a lot of GPU-WU's all day, but not one today. I haven't changed anything in the setup. And another question. I have activated the CPU and both GPUs on my computer, but only the NVIDIA GTX1650 Super is used, the Intel UHDgraphics 630 is always idle. Can the software only use one GPU at a time? I think it is not limited by WCG app but BOINC suite which manages apps. By default setting BOINC use just one GPU for crunching (fastest / most capable one). There is an BOINC option to allow run of GPU apps on all GPUs simultaneously located in "cc_config.xml" file in BOINC data root dir (not app dir where executables are located, but data dir): <cc_config> But also check project settings here(in WCG profile): if Intel GPU is allowed to run? There are 3 independent switches in the project setting for each major GPU vendor - NV, AMD and Intel). |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the answer.
----------------------------------------Both GPUs are active in the setup, but only the NVIDIA is used. Both GPUs are displayed under Projects>Properties>Scheduling. The difference between CPU and GPU is really great: ~ 3: 15 min / WU with NVIDIA. Fascinating. [Edit 1 times, last edit by Former Member at Apr 10, 2021 2:28:44 PM] |
||
|
|
![]() |