Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 781
|
![]() |
Author |
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster.
----------------------------------------maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time. I think the admins need to take a closer look at these batches. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I'm also seeing the 15xxx units starting to validate. I've seen 1582-1654 BOINC credits for the ones that have so far validated. No invalids since 4/25, good to see.
|
||
|
Jason Jung
Cruncher Joined: May 15, 2014 Post Count: 5 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am getting some very weird behavior here. The GPU seems to be loaded only very little. I have Ryzen 7 5800H (already running other tasks on both iGPU and CPU) and RTX3060. Is this normal? Thanks! https://i.ibb.co/7VDLfRR/Screenshot-2021-04-27-165859.png In Windows Task Manager you need to look at CUDA and not 3D for how much load these work units are producing. |
||
|
zdnko
Senior Cruncher Joined: Dec 1, 2005 Post Count: 225 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Scheduler request failed: Couldn't connect to server
Project communication failed: attempting access to reference site Internet access OK - project servers may be temporarily down. |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster. maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time. I think the admins need to take a closer look at these batches. It's most notable on AMD GPUs. The 4 digit batches rarely had more than 20% of the total run time as CPU time. The 5 digit batches have up to 60% CPU time of the total run time depending on how long they run. It's especially noticeable when the task starts. First 3-4 minutes is CPU time before it hands the task off to the GPU. May not be optimal but it sure helped put an end to the 10%+ rate of invalids for me.
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------![]() ![]() [Edit 1 times, last edit by nanoprobe at Apr 27, 2021 5:33:36 PM] |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
Does the continous disk writing i see people complaining about happen for temporary files? I'm not experiencing it, but i have my temporary folder (tmp) mounted in ramdisk through tmpfs, and i only see on the SSD the occasional "dumps" of data, every few minutes. Can this be it? I wouldn't worry about it. the fears of SSD writes are largely FUD and blown out of proportion. most modern SSDs can handle PETABYTES of writes before failure is a concern and they have more advanced wear leveling than earlier SSDs. that's continuous writing for 10+ years in most cases. and real world use will be far below that. Sorry Ian-n-Steve, I am going to have to disagree. It is not FUD or blown out of proportion. I think that there is a very wide distribution of total writes ratings across SSDs. Using the samsung 860 2tb and crucial mx500 2tb as examples: Total bytes written ratings: samsung: 2400 TB(2.4PB) crucial: 700 TB max MB/s to reach 10 years: samsung: ~7.61 crucial: ~2.22 Some people are seeing higher writes than either of those SSDs are rated for to last 10 years. I do not currently have server grade SSDs like nanoprobe has. There is a reason I run boinc off of a ramdisk these days - and it is not FUD. ![]() these ratings are for the purposes of warranty coverage, not expected failure. when I ran SETI (which ran tasks as fast or faster than here, with just as much SSD writing for downloading new WUs, writing results data, deleting old files, repeat) and ran that way for YEARS on low end small 80-120GB SSDs. none of them ever failed. a SATA cable failed once though. it's fine if you want to run a ram drive, that's your personal choice. but just because you decided to do that, doesn't mean the boogeyman is real. it's not necessary. I don't believe an SSD will actually fail from the loads here at WCG. any computer component is prone to early failure. i've had SSDs fail before. and they were all well below their rated TBW and MTBF values, none ever got enough writes to reach that point. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster. maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time. I think the admins need to take a closer look at these batches. It's most notable on AMD GPUs. The 4 digit batches rarely had more than 20% of the total run time as CPU time. The 5 digit batches have up to 60% CPU time of the total run time depending on how long they run. It's especially noticeable when the task starts. First 3-4 minutes is CPU time before it hands the task off to the GPU. May not be optimal but it sure helped put an end to the 10%+ rate of invalids for me. like i said, more GPU idle time. the GPU keeps the CPU engaged whether it's doing something or not. so with less GPU being used, the ratio of CPU time increases. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Scheduler request failed: Couldn't connect to server Project communication failed: attempting access to reference site Internet access OK - project servers may be temporarily down. Are you getting this multiple times in a row or once and then it works? Reason I ask is that we are updating the load balancer which requires it to restart. If you are getting it consistently then we may need look into the problem more. Thanks, -Uplinger |
||
|
Maxxina
Advanced Cruncher Joined: Jan 5, 2008 Post Count: 124 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Uplinger : so how much stuff we finished in almost 1 day of crunching ? :)
|
||
|
Mars4i
Cruncher Joined: Jan 4, 2019 Post Count: 5 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
Hi. I also see deviating behavior with the 13000 series. I have a lot of them in the 13510 series. I run one GPU case at a time (beside regular WCG CPU cases).
The 13510 series keeps 1 core of my AMD 1700 Ryzen busy full time. My NVidia GTX 1060 only run sometimes. So CPU limited. 1 case now takes 14 minutes (older cases about 4-5 minutes). All no problem for me. But for all these cases I see no wingman case. They all get validated with about 1600 points. But I don't care about the points, but do care if this is useful work? With no Wingman checking the results, can I be sure these results are indeed valid/useful? |
||
|
|
![]() |