World Community Grid - View Thread - CPU work for OpenPandemics stopped

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: CPU work for OpenPandemics stopped

Quick Go »

No member browsing this thread

Thread Status: Active
Thread Type: Sticky Thread
Total posts in this thread: 27

[ ]

Author

This topic has been viewed 9568 times and has 26 replies

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


CPU work for OpenPandemics stopped

There appears to be an issue with recent jobs causing excessive memory consumption on end-user machines. We have stopped sending work for these tasks while we investigate the issue.

----------------------------------------
[Edit 1 times, last edit by knreed at Oct 7, 2021 4:17:47 PM]

[Oct 5, 2021 4:22:10 PM]

rcthardcore
Cruncher
United States
Joined: Jan 29, 2009
Post Count: 13
Status: Offline
Project Badges:

45 day badge for OpenPandemics - COVID-19


Re: CPU work for OpenPandemics stopped

The same problem is present on the Mapping Cancer Markers project too. You might want to check into this.

----------------------------------------

AMD Ryzen 9 5950x
NVIDIA RTX 3090 FE
128 GB DDR4-3200
Windows 10 64-bit 21H1

[Oct 5, 2021 8:34:12 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: CPU work for OpenPandemics stopped

The direction that we are going to take with this is as follows:

- For workunits that have copies already sent out, we are changing the resource configuration for memory to reflect the much larger RAM used (either 4GB or 1.5GB depending on the specific job). We are also boosting the disk size to accommodate extra size. This will allow BOINC to manage sending resends to only computers that can handle them (it will also keep multiple copies from running at once)
- For the other jobs we are going to rebuild them but limit them to 20 per workunit so that this issue doesn't consume massive memory. This will result in a period of quick jobs with a little bit larger than normal memory footprint, but it cause the issues like these have.
- Longer term the memory leak will need to be fixed (it is 14 MB per job within the workunit)

Testing the changes and getting things in place will take a bit time so CPU work will be stopped for the next 24-48 hours while we prepare and test.

[Oct 6, 2021 1:58:52 AM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2195
Status: Recently Active
Project Badges:

14 day badge for FightAIDS@Home - Phase 2

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: CPU work for OpenPandemics stopped

Thanks knreed for the update.
Release the Kraken biggrin

[Oct 6, 2021 2:16:22 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: CPU work for OpenPandemics stopped

Release the Kraken biggrin

The kraken's have been released. wink

We have re-enabled sending worker for OpenPandemics CPU. However, at this time the only jobs available are the repair jobs for the workunits that require large memory. The workunits have been modified to state that they need either 4 GB or 1.5 GB depending on the job. This will ensure that they run successfully.

There are only 48,000 of these at the moment, but given the high requirements I don't know how long it will take to distribute them.

We continue to work on recreating the other batches with limits to prevent the high memory use.

[Oct 6, 2021 9:31:46 PM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2195
Status: Recently Active
Project Badges:


Re: CPU work for OpenPandemics stopped

Thanks for the Kraken, knreed. smile

So far, only resends for normal "No Reply", or "Error" tasks, from earlier batches than the problem batches.

[Oct 6, 2021 10:07:40 PM]

BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: CPU work for OpenPandemics stopped

Release the Kraken biggrin

The kraken's have been released.

We have re-enabled sending worker for OpenPandemics CPU. However, at this time the only jobs available are the repair jobs for the workunits that require large memory. The workunits have been modified to state that they need either 4 GB or 1.5 GB depending on the job. This will ensure that they run successfully.

There are only 48,000 of these at the moment, but given the high requirements I don't know how long it will take to distribute them.

We continue to work on recreating the other batches with limits to prevent the high memory use.

I thought we were talking about more GPU WUs. sad

----------------------------------------

MyCity

[Oct 6, 2021 11:09:43 PM]

mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 77
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

180 day badge for Smash Childhood Cancer

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: CPU work for OpenPandemics stopped

The workset is apparently still to large for the Raspberry Pi 400. The server sends a message saying that there is not enough memory (3814.70 MB RAM needed but only 3455.31 available). Note this is an ARM 64 machine, but running a 32-bit OS.

Something still needs to be fixed. I have been running this project for several months on my Raspberry Pi. It stopped when this forum entry was created.

----------------------------------------
[Edit 1 times, last edit by mwroggenbuck at Oct 6, 2021 11:56:51 PM]

[Oct 6, 2021 11:52:48 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: CPU work for OpenPandemics stopped

Actually - that is exactly what we want to have happen. When I wrote

We have re-enabled sending work for OpenPandemics CPU. However, at this time the only jobs available are the repair jobs for the workunits that require large memory. The workunits have been modified to state that they need either 4 GB or 1.5 GB depending on the job. This will ensure that they run successfully.

What this means is that the memory limit for these jobs has been reset so that those jobs will not be sent to computers with insufficient memory to handle them. Your raspberry pi is protected from these jobs because of this limit.

Most of the 4GB resends have now been sent out and we are about to start going through the jobs that require 1.5 GB. There are currently 21,888 of them to send out. Once we get through those we will then be sending out much more normal sized jobs and your computer should start to receive them at this time. That should be in another hour or two.

[Oct 7, 2021 12:03:04 AM]

[ ]