World Community Grid - View Thread - Comprehensive Issue List & Report Thread (Feb. 24, 2023)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: Comprehensive Issue List & Report Thread (Feb. 24, 2023)

Quick Go »

No member browsing this thread

Thread Status: Active
Thread Type: Sticky Thread
Total posts in this thread: 427

[ ]

Author

This topic has been viewed 881322 times and has 426 replies

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 986
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

I have been running a 0.6 day cache for years... By using WCG profiles to control how many of each task type I get sent, I can maintain a work load that doesn't quite fill that cache, so when a few units finish they get replenished quite quickly; indeed, when others were complaining about no work my MCM1 caches were all either full or just one or two tasks short... The same applied to OPN1/G and ARP1 when work [other than retries] was actually available.

I still get the odd Server Aborted, but (as pointed out) that's one of the penalties for getting lots of retries... And, of course, it shouldn't abort a user task if it has passed the first checkpoint unless the work unit is known to be bad, so Server Aborted is only bad news if there's a long time between checkpoints - waste of a download, yes; waste of CPU, probably not...

The only BOINC project I participate in that has very long running work units (CPDN) seems to send them out based on the number of available "cores" rather than buffer capacity, so I don't have problems with that :-)

Seems that there are some who feel the need to have a 6-day cache. Why? I have no clue. I can see a day or two, but not a week.

I suspect a lot of those users happened to make their first contact with a BOINC project that had high default cache configurations (back in the day, 10+10!...), and that then propagates to their preferences on other projects -- if they don''t realize it's a problem they won't do anything about it (more on this later..)

Also, I wonder what the Science United user default setup is at WCG - not every user will know that they can change caches and other settings via the client, so if those are high...

At the 6:00:00-day mark, the system does an automatic RESEND on that WU.

Before the pre-hiatus workload clear-out, most WCG projects used the BOINC grace-day option to give No Reply jobs an extra day's grace before sending a retry -- that meant that any "slow" users would either hit "Not Started by Deadline" and self-abort, or would have a bit of extra time to return. That would have stopped most of the Server Aborts I've had (though it's amazing how often I can get my retry sent back before the.late returner sends their result in, especially for OPN1/OPNG!)

I'd be tempted to leave ARP1 as it is - I can put up with the odd aborted job in exchange for trying to keep the generations moving on... However, I'd be inclined to shave a day off the 6-day deadlines on other projects and put the grace day back... I don't think any of the other projects (even HST1) need 5+ days to run a single task, and I wonder how many really slow machines are still running here anyway...

To anyone running a long queue, please make sure that all your WU’s are well clear of your system before the 6:00:00-day mark!

Unfortunately, I suspect significant numbers of the users with large caches are running in "fire and forget" mode and may not realize they are causing any issues (as I mentioend above). Such users are highly unlikely to see anything posted on here :-( And there will also be users whose attitude is likely to be "so what -- I never run out of work and that's what matters" (but how many of those tasks are actually returned in time and count for anything?)

I'm not sure what can be done about it in practical terms -- that's one of the reasons I'd quite like to see the grace period back for most WCG projects...

Cheers - Al.

[Dec 28, 2022 7:19:19 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 812
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

20 year badge for Mapping Cancer Markers

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

At least anecdotally, I initially put 1 day on my quad core machine, and that wasn't enough to stay fed (especially over that 3-4 day dry spell). I bumped it to 3 days like my slower machine, but the BOINC calculation said "job cache full" when it had less than a day's worth of work, which was odd. So now at the 5 day setting, it has about 2 days' worth of MCM. I tend to babysit my BOINC boxes more than I probably should, but it's fun for me. I definitely return all work units within 1-3 days they are issued.

If/when things get better, I'll lower things back to 1-2 day caches. [Edit: Just changed the quad core machine back to 3 days and will see how that goes again.]

In the past, I've only gotten "greedy" if I was super close to the next badge and the project was ending or going on hiatus e.g. Zika or Ebola or SCC pause.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Dec 28, 2022 1:07:11 PM]

[Dec 28, 2022 1:02:17 PM]

Blount
Senior Cruncher
Joined: Aug 19, 2005
Post Count: 474
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

200 year badge for Mapping Cancer Markers

45 day badge for Uncovering Genome Mysteries

1 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

I run with a 2.0 cache and have for years. Lately I have 4 hours or less of tasks waiting. In the recent past I would have a cache of almost 2 days. Something changed and the cache and active jobs are seeing lots of dry spells.

MCM tasks run in 1hr or less on the AMD 7950X, 1:4hr-ish on the AMD 5950X and 1:42 plus on the AMD 3950x. Larger deltas when running ARP. If running over 8 ARPs on a machine the run time extends significantly. I limit the machines to 4 or 8 ARPs.

[Dec 28, 2022 3:51:17 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7697
Status: Offline
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer

100 year badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

I am really satisfied running the 3 day cache at the moment. It seems to keep my machines well fed without overdoing it. Almost all of the "server aborts" have no cpu time associated with them, so they did use some bandwidth on the download, but that is minimal. Once the supply has stabilized for a while I will cut back to a 1.5 day cache. That always seemed to work fairly well.
Just as a side note, with the MCM units the LOO type run much quicker on the Linux systems and NFCV type run quicker on the Windows systems. On the Windows I7-3770 the NFCV type run about 1.75 hours and the LOO type run 2.75 to 3 hours.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at Dec 28, 2022 5:27:20 PM]

[Dec 28, 2022 5:23:03 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 303
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

You have me at a loss. Please explain what you mean about the types:

Just as a side note, with the MCM units the LOO type run much quicker on the Linux systems and NFCV type run quicker on the Windows systems. On the Windows I7-3770 the NFCV type run about 1.75 hours and the LOO type run 2.75 to 3 hours.

[Dec 29, 2022 2:49:24 AM]

nivrip
Senior Cruncher
North Yorkshire
Joined: Sep 13, 2007
Post Count: 264
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for Help Fight Childhood Cancer

1 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

1 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

20 year badge for Smash Childhood Cancer

5 year badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

You have me at a loss. Please explain what you mean about the types:

Yes, I need some education on this topic too.

----------------------------------------

ЮРКШИР КРУНЧЕР

[Dec 29, 2022 10:09:01 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7697
Status: Offline
Project Badges:


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

This is from the results log of a completed work unit. The line VMethod = NFCV specifies the type to which I was referring.

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc-linux-gnu -SettingsFile MCM1_0193802_5599.txt -DatabaseFile dataset-sarc1.txt
Settings File
DateOfDesign = 20200218
Designer = Krembil/cubes
WorkOrderID = 0193802_5599
DatasetID = sarc1
RSeed = 334745600
StartingGeneSignatureAlgorithm = randomFixedLengthSearch
RunPermutationAlgorithm = 0
FitnessFn = 0
NumberOfGenesInStartingSignature = 20
NumberOfGenesInSignatureMin = 20
NumberOfGenesInSignatureMax = 20
SearchAlgorithmNumberToCreate = 12071
MinFitness = 0.497
VMethod = NFCV
NFolds = 20
SvmArgs = "-v 0 -t 0 -c 1000"
SvmLearnLimit = 250000

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 29, 2022 2:59:09 PM]

goben_2003
Advanced Cruncher
Joined: Jun 16, 2006
Post Count: 146
Status: Offline
Project Badges:

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

The issues we are aware of but remain unresolved:

Devices appears on the Results list but not the Devices list (In progress)
New devices should now show up within 24 hours.

(Emphasis added)

Hi Cyclops,

They are not showing up within 24 hours. I added a new device which returned its first result ~84 hours ago (3.5 days). Is there an update when this will be fixed? I saw you say it was marked as a high priority bug. When will this high priority bug be fixed? If it was 24 hours it would be understandable as a temporary work around until the real fix is in. I waited (patiently?) until 3.5 days to come look at the forums and see what the problem was.

To you and bfmorse: from what the tech team was able to find, it seems like the devices were registered in the BOINC database but excluded from the website database. Many devices that were missing have been synchronized between the two using the procedure provided by IBM. However, there are still issues with displaying information about those devices that were synchronized between BOINC and DB2 as you identified, and also w.r.t. statistics in My Contribution or listings in My Contribution. The tech team will be working to fix this issue and have classified it as a high priority bug.
In the meantime, I will add this issue to the Comprehensive Bug List. Any updates will be shared on that thread as well as this one.

----------------------------------------

[Jan 2, 2023 10:47:46 PM]

Igelwurst
Cruncher
Germany
Joined: Jun 29, 2015
Post Count: 23
Status: Offline
Project Badges:

14 day badge for Africa Rainfall Project


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

Hi,

I have the same issue... the new device has been working for about 2 weeks and is not displayed in the device list.

br Igelwurst

----------------------------------------

[Jan 27, 2023 1:16:29 PM]

bonami2
Cruncher
Joined: Nov 8, 2021
Post Count: 1
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: Comprehensive Issue List & Report Thread (Dec. 16, 2022)

Being running for more than 2 week and got no statistic update since the last work i did like 2 years ago. Result are uploading. But no device or stat update.
Going back to folding@home

----------------------------------------
[Edit 1 times, last edit by bonami2 at Feb 7, 2023 1:07:02 PM]

[Feb 7, 2023 1:06:31 PM]

[ ]