World Community Grid - View Thread - New Beta Test

World Community Grid Forums

Category: Beta Testing

Forum: Beta Test Support Forum

Thread: New Beta Test – May 29, 2019 [Issues Thread]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 315

[ ]

Author

This topic has been viewed 67961 times and has 314 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Test – May 29, 2019 [Issues Thread]

Something to ponder: Based on what we know so far about this potential project, it might not be appropriate for ALL members. This is probably due to the runtime, memory requirements, and/or bandwidth. It is my understanding that WCG compiles to the lowest common denominator to allow for the greatest participation. If we assume that this project will only be run by a subset of the members, most likely, those with larger and newer resources, why not apply compiler optimizations to facilitate runtime by using newer faster instructions? This project is probably not appropriate for a Pentium processor.

[May 31, 2019 1:46:48 PM]

Michael Goetz
Cruncher
United States
Joined: Dec 11, 2017
Post Count: 35
Status: Offline
Project Badges:

14 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: New Beta Test – May 29, 2019 [Issues Thread]

it is computing OK, but the progress bar and time to complete are faulty in more than one way.

Indeed. It appears as if the estimated GFLOPS setting is substantially too low.

Also, although this isn't a problem per se, these tasks seem to have a linear and well behaved progress bar. It would be beneficial for them to utilize BOINC's <fraction_done_exact/> mechanism, which would override BOINC's default "smart" estimation. This would compensate for the bad rsc_fpops_est setting.

Fortunately, YOU can set <fraction_done_exact/> yourself by using app_config.xml. I haven't tried it myself for this beta, but this should work:

<app_config>
  <app>
    <name>beta27</name>
    <fraction_done_exact>1</fraction_done_exact>
  </app>
</app_config>

That file goes in the projects/www.worldcommunitygrid.org directory.

Once the task starts running, BOINC will compute the estimated total time as exactly <elapsed time> / <progress>. On these beta tasks, that should give a fairly accurate remaining time estimate.

It won't work until the task has started running, and you may need to restart BOINC for it to take effect.

I think beta27 is the correct app name. Can anyone confirm that?

This assumes you're using the standard BOINC client. All bets are off if you're using a customized version of BOINC, which might or might not support this feature.

[May 31, 2019 1:51:07 PM]

Michael Goetz
Cruncher
United States
Joined: Dec 11, 2017
Post Count: 35
Status: Offline
Project Badges:


Re: New Beta Test – May 29, 2019 [Issues Thread]

Must be something about a safety-net function... if your device takes too long, the claim will be slashed to what the fpops header said comes with 1 hour 29 minutes. Yes, that would be about right were my computer in the q-bit class. ;o)

While the workunit's rsc_fpops_est setting is responsible for the time estimates being too low (see my earlier comment about you being able to use <fraction_done_exact/> to partly fix that yourself), there's another workunit setting rsc_fpops_bound which controls the "safety net" you're describing.

Setting rsc_fpops_est too low is a nuisance that causes time estimates to behave badly. Setting rsc_fpops_bound too low is a real problem as it causes perfectly good tasks to abort. Fortunately, it's an easy change to whatever work generator they're using.

I agree with you 100% that this is a bad feature. I personally set rsc_fpops_bound to a very large multiple of rsc_fpops_est. I never, ever, want BOINC deciding on its own that a task has been running too long. It's definitely not smart enough for that -- and it's especially bad when it comes to time estimates. Unfortunately, this is something that can only be set from the server. You can't fix this from the user's side.

This is why we have betas. smile

----------------------------------------
[Edit 1 times, last edit by Michael Goetz at May 31, 2019 2:04:55 PM]

[May 31, 2019 2:02:52 PM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1322
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: New Beta Test – May 29, 2019 [Issues Thread]

I think beta27 is the correct app name. Can anyone confirm that?

Correct.

I personally set rsc_fpops_bound to a very large multiple of rsc_fpops_est. I never, ever, want BOINC deciding on its own that a task has been running too long. It's definitely not smart enough for that -- and it's especially bad when it comes to time estimates. Unfortunately, this is something that can only be set from the server. You can't fix this from the user's side.

I changed it myself in client_state, because I already lost 2 tasks with exceeded time limit as I mentioned in an earlier post and don't want to loose tasks for that reason. Hopefully new beta-tasks will have a more realistic rsc_fpops_est.
Beneficial side effect: Testing the restart of tasks.

Before I could change it on this machine another task with the exceeded time limit error:

https://www.worldcommunitygrid.org/ms/device/...og.do?resultId=1000181925

World Community Grid 31 May 16:03:33 Aborting task BETA_ARP1_0000370_000_1: exceeded elapsed time limit 153481.54 (547904.24G/3.57G)

----------------------------------------
[Edit 3 times, last edit by Crystal Pellet at May 31, 2019 2:46:23 PM]

[May 31, 2019 2:10:44 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

90 day badge for Nutritious Rice for the World

10 year badge for Mapping Cancer Markers

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: New Beta Test – May 29, 2019 [Issues Thread]

If we assume that this project will only be run by a subset of the members, most likely, those with larger and newer resources, why not apply compiler optimizations to facilitate runtime by using newer faster instructions? This project is probably not appropriate for a Pentium processor.

We will not automatically opt users into this project because of the reasons you listed, memory, storage, and bandwidth. We are using all the compiler optimizations we can. The issue limiting us from using more is that for this project we can only validate successful results with binary equivalence. This application is heavy in floating point calculations and without some optimization limitations we would not get binary equivalence due to rounding differences even on different current generation processors.

Thanks,
armstrdj

[May 31, 2019 2:52:03 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: New Beta Test – May 29, 2019 [Issues Thread]

The current error and invalid rate for this beta is low and looking good. We will not be adding any new work units until probably early next week. We will add the automation pieces to the mix that takes the validated results and builds the next 48 hour simulation period. Once this is enabled all of the validated results from the current 2000 simulations will have work available for the continuation. Then as those finish and validate the next 48 hours will be built and loaded. For beta we will probably go 3 - 5 levels deep or in terms of simulation time 6 to 10 days of simulation. The current plan for production is to simulate an entire calendar year.

Thanks,
armstrdj

[May 31, 2019 3:00:11 PM]

pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for The Clean Energy Project - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

100 year badge for Africa Rainfall Project

200 year badge for OpenPandemics - COVID-19


Re: New Beta Test – May 29, 2019 [Issues Thread]

Doneske's post has been largely answered by armstrdj. What I can add is that there is a "Project Limits" setting in the custom device profile that allows you to choose an upper limit to the number of WUs of a given project that is assigned at any given time to a rig (i.e. in the queue and not necessarily running). That gives you a reasonable amount of control over resources like memory, disk, and bandwidth.

[May 31, 2019 4:08:35 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Test – May 29, 2019 [Issues Thread]

Hi devs, WU 1142300482
(https://www.worldcommunitygrid.org/ms/device/....do?workunitId=1142300482)

Host: WCG deviceId 5429224

- dual E5-2660v2 w/HT on dedicated to only WCG
- all 40 threads run WCG random units
- CPU threads are capped at 95% each using systemd
- Debian 9 basic install, which places all threads into +20 nice
- CPU temps remain about 75C on this chassis but is +20C between CPUs:
- BOINC is configured in "run always" mode and prefs override sets all values to 100% (as systemd will take care of limiting them now)

$ sensors | grep Physical
Physical id 1: +75.0°C (high = +76.0°C, crit = +86.0°C)
Physical id 0: +57.0°C (high = +76.0°C, crit = +86.0°C)

Chassis: Supermicro X9DRi-LN4+/X9DR3-LN4+

Please let me know if I can provide any more useful information about the host which ran the beta WU. It appears I've only been farmed out this one.

Edit: added note about configuration of BOINC client

----------------------------------------
[Edit 1 times, last edit by xithryx at May 31, 2019 5:01:43 PM]

[May 31, 2019 4:56:52 PM]

Jean-David Beyer
Senior Cruncher
USA
Joined: Oct 2, 2007
Post Count: 337
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

90 day badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

45 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

90 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

90 day badge for Uncovering Genome Mysteries

180 day badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project


Re: New Beta Test – May 29, 2019 [Issues Thread]

Fortunately, YOU can set <fraction_done_exact/> yourself by using app_config.xml. I haven't tried it myself for this beta, but this should work:
<app_config>
<app>
<name>beta27</name>
<fraction_done_exact>1</fraction_done_exact>
</app>
</app_config>

That file goes in the projects/www.worldcommunitygrid.org directory. Once the task starts running, BOINC will compute the estimated total time as exactly
<elapsed time> / <progress>.
On these beta tasks, that should give a fairly accurate remaining time estimate. It won't work until the task has started running, and you may need to restart BOINC for it to take effect.

There was no such file in /home/boinc/projects/www.worldcommunitygrid.org, so I put this in there. I then stopped the boinc_manager, then shut down boinc, waited 15 seconds and restarted it. I then restarted the boinc_manager. The result was the elapsed time for the two running beta processes dropped to exactly 25% and the times to complete dropped quite a bit. So that file was noticed and changed something. On the other hand, the time to complete is still increasing slowly.

My boinc_manager is 7.2.33 that is the latest one in EPEL for my distro.

----------------------------------------

[May 31, 2019 7:50:44 PM]

Jean-David Beyer
Senior Cruncher
USA
Joined: Oct 2, 2007
Post Count: 337
Status: Offline
Project Badges:


Re: New Beta Test – May 29, 2019 [Issues Thread]

I got rid of my old machine that had dual Pentium-III orocessors some years ago. Likewise, I had a machine with two 32-bit 3.06 GHz Xeon processors that died after 10 year or so when tropical storm Sand took it out. I now have a slower (1.8 GHz) four-core 64-bit Xeon processor and 16 GBytes of RAM. -- 8 modules of 2 GByte RAM, the kind that must be installed in pairs. This kind:

Capacity: 2GB Module - ECC Reg - DDR3-10600 (PC3-1333)
https://www.amazon.com/gp/product/B00HH8FOEK/...o06_s00?ie=UTF8&psc=1

If I remember correctly, I paid about 10x as much for the first 8GBytes of that stuff when I got that machine.

Detected 1795.700 MHz processor.
Calibrating delay loop (skipped), value calculated using timer frequency.. 3591.40 BogoMIPS (lpj=1795700)
pid_max: default: 32768 minimum: 301
Security Framework initialized

CPU0: Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz stepping 07
Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, full-width counters, Intel PMU driver.
... version: 3
... bit width: 48
... generic registers: 8
... value mask: 0000ffffffffffff
... max period: 0000ffffffffffff
... fixed-purpose events: 3
... event mask: 00000007000000ff
NMI watchdog enabled, takes one hw-pmu counter.
Booting Node 0, Processors #1 #2 #3
Brought up 4 CPUs
Total of 4 processors activated (14365.60 BogoMIPS).

----------------------------------------

[May 31, 2019 8:25:27 PM]

[ ]