Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 781
Posts: 781   Pages: 79   [ Previous Page | 25 26 27 28 29 30 31 32 33 34 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 513289 times and has 780 replies Next Thread
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster.

maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time.

I think the admins need to take a closer look at these batches.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Apr 27, 2021 5:04:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I'm also seeing the 15xxx units starting to validate. I've seen 1582-1654 BOINC credits for the ones that have so far validated. No invalids since 4/25, good to see.
[Apr 27, 2021 5:12:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jason Jung
Cruncher
Joined: May 15, 2014
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I am getting some very weird behavior here. The GPU seems to be loaded only very little. I have Ryzen 7 5800H (already running other tasks on both iGPU and CPU) and RTX3060. Is this normal? Thanks!
https://i.ibb.co/7VDLfRR/Screenshot-2021-04-27-165859.png


In Windows Task Manager you need to look at CUDA and not 3D for how much load these work units are producing.
[Apr 27, 2021 5:31:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
zdnko
Senior Cruncher
Joined: Dec 1, 2005
Post Count: 225
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Scheduler request failed: Couldn't connect to server
Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.
[Apr 27, 2021 5:32:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster.

maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time.

I think the admins need to take a closer look at these batches.

It's most notable on AMD GPUs. The 4 digit batches rarely had more than 20% of the total run time as CPU time. The 5 digit batches have up to 60% CPU time of the total run time depending on how long they run. It's especially noticeable when the task starts. First 3-4 minutes is CPU time before it hands the task off to the GPU. May not be optimal but it sure helped put an end to the 10%+ rate of invalids for me.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 1 times, last edit by nanoprobe at Apr 27, 2021 5:33:36 PM]
[Apr 27, 2021 5:32:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Does the continous disk writing i see people complaining about happen for temporary files? I'm not experiencing it, but i have my temporary folder (tmp) mounted in ramdisk through tmpfs, and i only see on the SSD the occasional "dumps" of data, every few minutes. Can this be it?


I wouldn't worry about it. the fears of SSD writes are largely FUD and blown out of proportion. most modern SSDs can handle PETABYTES of writes before failure is a concern and they have more advanced wear leveling than earlier SSDs. that's continuous writing for 10+ years in most cases. and real world use will be far below that.

Sorry Ian-n-Steve, I am going to have to disagree. It is not FUD or blown out of proportion. I think that there is a very wide distribution of total writes ratings across SSDs. Using the samsung 860 2tb and crucial mx500 2tb as examples:
Total bytes written ratings:
samsung: 2400 TB(2.4PB)
crucial: 700 TB
max MB/s to reach 10 years:
samsung: ~7.61
crucial: ~2.22

Some people are seeing higher writes than either of those SSDs are rated for to last 10 years. I do not currently have server grade SSDs like nanoprobe has. There is a reason I run boinc off of a ramdisk these days - and it is not FUD. tongue


these ratings are for the purposes of warranty coverage, not expected failure.

when I ran SETI (which ran tasks as fast or faster than here, with just as much SSD writing for downloading new WUs, writing results data, deleting old files, repeat) and ran that way for YEARS on low end small 80-120GB SSDs. none of them ever failed. a SATA cable failed once though.

it's fine if you want to run a ram drive, that's your personal choice. but just because you decided to do that, doesn't mean the boogeyman is real. it's not necessary. I don't believe an SSD will actually fail from the loads here at WCG. any computer component is prone to early failure. i've had SSDs fail before. and they were all well below their rated TBW and MTBF values, none ever got enough writes to reach that point.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Apr 27, 2021 5:42:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster.

maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time.

I think the admins need to take a closer look at these batches.

It's most notable on AMD GPUs. The 4 digit batches rarely had more than 20% of the total run time as CPU time. The 5 digit batches have up to 60% CPU time of the total run time depending on how long they run. It's especially noticeable when the task starts. First 3-4 minutes is CPU time before it hands the task off to the GPU. May not be optimal but it sure helped put an end to the 10%+ rate of invalids for me.


like i said, more GPU idle time. the GPU keeps the CPU engaged whether it's doing something or not. so with less GPU being used, the ratio of CPU time increases.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Apr 27, 2021 5:47:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Scheduler request failed: Couldn't connect to server
Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.


Are you getting this multiple times in a row or once and then it works? Reason I ask is that we are updating the load balancer which requires it to restart. If you are getting it consistently then we may need look into the problem more.

Thanks,
-Uplinger
[Apr 27, 2021 5:49:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Maxxina
Advanced Cruncher
Joined: Jan 5, 2008
Post Count: 124
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Uplinger : so how much stuff we finished in almost 1 day of crunching ? :)
[Apr 27, 2021 5:51:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mars4i
Cruncher
Joined: Jan 4, 2019
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Hi. I also see deviating behavior with the 13000 series. I have a lot of them in the 13510 series. I run one GPU case at a time (beside regular WCG CPU cases).

The 13510 series keeps 1 core of my AMD 1700 Ryzen busy full time. My NVidia GTX 1060 only run sometimes. So CPU limited. 1 case now takes 14 minutes (older cases about 4-5 minutes).

All no problem for me. But for all these cases I see no wingman case. They all get validated with about 1600 points. But I don't care about the points, but do care if this is useful work? With no Wingman checking the results, can I be sure these results are indeed valid/useful?
[Apr 27, 2021 5:54:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 781   Pages: 79   [ Previous Page | 25 26 27 28 29 30 31 32 33 34 | Next Page ]
[ Jump to Last Post ]
Post new Thread