World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 534401 times and has 780 replies

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I think these 13,000+ batches might actually be malformed. the behavior is quite different than the earlier batches, and uplinger suggested that he thought they should actually run faster.

maybe some incorrect setting or parameter that's causing too much GPU idle time. as others have noticed, GPU utilization and power consumption is WAY lower than previous batches. it's not that it's using "more CPU" (keeping a CPU core engaged for nvidia with openCL is normal behavior, and that's all the CPU use is) it's just not using the GPU as much as earlier batches, which is elevating the run time.

I think the admins need to take a closer look at these batches.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 27, 2021 5:04:32 PM]

spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:

50 year badge for Mapping Cancer Markers

1 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I'm also seeing the 15xxx units starting to validate. I've seen 1582-1654 BOINC credits for the ones that have so far validated. No invalids since 4/25, good to see.

[Apr 27, 2021 5:12:17 PM]

Jason Jung
Cruncher
Joined: May 15, 2014
Post Count: 5
Status: Offline
Project Badges:

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

I am getting some very weird behavior here. The GPU seems to be loaded only very little. I have Ryzen 7 5800H (already running other tasks on both iGPU and CPU) and RTX3060. Is this normal? Thanks!
https://i.ibb.co/7VDLfRR/Screenshot-2021-04-27-165859.png

In Windows Task Manager you need to look at CUDA and not 3D for how much load these work units are producing.

[Apr 27, 2021 5:31:30 PM]

zdnko
Senior Cruncher
Joined: Dec 1, 2005
Post Count: 225
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Scheduler request failed: Couldn't connect to server
Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.

[Apr 27, 2021 5:32:03 PM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project


Re: OpenPandemics - GPU Stress Test

It's most notable on AMD GPUs. The 4 digit batches rarely had more than 20% of the total run time as CPU time. The 5 digit batches have up to 60% CPU time of the total run time depending on how long they run. It's especially noticeable when the task starts. First 3-4 minutes is CPU time before it hands the task off to the GPU. May not be optimal but it sure helped put an end to the 10%+ rate of invalids for me.

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

----------------------------------------
[Edit 1 times, last edit by nanoprobe at Apr 27, 2021 5:33:36 PM]

[Apr 27, 2021 5:32:26 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

Does the continous disk writing i see people complaining about happen for temporary files? I'm not experiencing it, but i have my temporary folder (tmp) mounted in ramdisk through tmpfs, and i only see on the SSD the occasional "dumps" of data, every few minutes. Can this be it?

I wouldn't worry about it. the fears of SSD writes are largely FUD and blown out of proportion. most modern SSDs can handle PETABYTES of writes before failure is a concern and they have more advanced wear leveling than earlier SSDs. that's continuous writing for 10+ years in most cases. and real world use will be far below that.

Sorry Ian-n-Steve, I am going to have to disagree. It is not FUD or blown out of proportion. I think that there is a very wide distribution of total writes ratings across SSDs. Using the samsung 860 2tb and crucial mx500 2tb as examples:
Total bytes written ratings:
samsung: 2400 TB(2.4PB)
crucial: 700 TB
max MB/s to reach 10 years:
samsung: ~7.61
crucial: ~2.22

Some people are seeing higher writes than either of those SSDs are rated for to last 10 years. I do not currently have server grade SSDs like nanoprobe has. There is a reason I run boinc off of a ramdisk these days - and it is not FUD. tongue

these ratings are for the purposes of warranty coverage, not expected failure.

when I ran SETI (which ran tasks as fast or faster than here, with just as much SSD writing for downloading new WUs, writing results data, deleting old files, repeat) and ran that way for YEARS on low end small 80-120GB SSDs. none of them ever failed. a SATA cable failed once though.

it's fine if you want to run a ram drive, that's your personal choice. but just because you decided to do that, doesn't mean the boogeyman is real. it's not necessary. I don't believe an SSD will actually fail from the loads here at WCG. any computer component is prone to early failure. i've had SSDs fail before. and they were all well below their rated TBW and MTBF values, none ever got enough writes to reach that point.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 27, 2021 5:42:23 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

like i said, more GPU idle time. the GPU keeps the CPU engaged whether it's doing something or not. so with less GPU being used, the ratio of CPU time increases.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 27, 2021 5:47:11 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Scheduler request failed: Couldn't connect to server
Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.

Are you getting this multiple times in a row or once and then it works? Reason I ask is that we are updating the load balancer which requires it to restart. If you are getting it consistently then we may need look into the problem more.

Thanks,
-Uplinger

[Apr 27, 2021 5:49:06 PM]

Maxxina
Advanced Cruncher
Joined: Jan 5, 2008
Post Count: 124
Status: Offline
Project Badges:

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Uncovering Genome Mysteries

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Uplinger : so how much stuff we finished in almost 1 day of crunching ? :)

[Apr 27, 2021 5:51:18 PM]

Mars4i
Cruncher
Joined: Jan 4, 2019
Post Count: 5
Status: Offline
Project Badges:

45 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

Hi. I also see deviating behavior with the 13000 series. I have a lot of them in the 13510 series. I run one GPU case at a time (beside regular WCG CPU cases).

The 13510 series keeps 1 core of my AMD 1700 Ryzen busy full time. My NVidia GTX 1060 only run sometimes. So CPU limited. 1 case now takes 14 minutes (older cases about 4-5 minutes).

All no problem for me. But for all these cases I see no wingman case. They all get validated with about 1600 points. But I don't care about the points, but do care if this is useful work? With no Wingman checking the results, can I be sure these results are indeed valid/useful?

[Apr 27, 2021 5:54:30 PM]

[ ]