World Community Grid - View Thread - Validation Running Behind?

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Validation Running Behind?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 57

[ ]

Author

This topic has been viewed 4632 times and has 56 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Validation Running Behind?

Afterthought for ingleside: Whilst the proposed validator load splitting strikes on the face as logical and efficient, the results for all sciences seem to run one long series... we're now somewhere at 616 million originals, before splitting them for quorum [You see those numbers running past when updating with WCGDAWS]. Just wondered how that works when there are multiple sciences, a validator or validators per science. Feel like these could be running cycles to find out, so guess there's working off some subset table or secondary indices to find which ones they should be looking at.

[Jan 16, 2013 3:23:59 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Validation Running Behind?

Here are the backend daemons that are currently running:

Server #1
   1   7430  running     locked   no        c4cw_validator --d 2 --sleep_interval 10 --app c4cw
   2   7433  running     locked   no        c4cw_assimilator --d 2 --sleep_interval 10 --app c4cw --mod 2 0
   3   7436  running     locked   no        hcc1_validator --d 2 --sleep_interval 2 --app hcc1 --mod 6 0
   4   7461  running     locked   no        hcc1_validator1 --d 2 --sleep_interval 2 --app hcc1 --mod 6 1
   5   7464  running     locked   no        hcc1_validator2 --d 2 --sleep_interval 2 --app hcc1 --mod 6 2
   6   7467  running     locked   no        hcc1_validator3 --d 2 --sleep_interval 2 --app hcc1 --mod 6 3
   7   7470  running     locked   no        hcc1_validator4 --d 2 --sleep_interval 2 --app hcc1 --mod 6 4
   8   7473  running     locked   no        hcc1_validator5 --d 2 --sleep_interval 2 --app hcc1 --mod 6 5
   9   7477  running     locked   no        hcc1_assimilator --d 2 --sleep_interval 2 --app hcc1 --mod 8 0
  10   7486  running     locked   no        hcc1_assimilator1 --d 2 --sleep_interval 2 --app hcc1 --mod 8 1
  11   7489  running     locked   no        hcc1_assimilator2 --d 2 --sleep_interval 2 --app hcc1 --mod 8 2
  12   7492  running     locked   no        hcc1_assimilator3 --d 2 --sleep_interval 2 --app hcc1 --mod 8 3
  13   7495  running     locked   no        sn2s_validator --d 2 --sleep_interval 10 --app sn2s
  14   7498  running     locked   no        sn2s_assimilator --d 2 --sleep_interval 10 --app sn2s --mod 2 0
  15   7501  running     locked   no        sn2s_assimilator1 --d 2 --sleep_interval 10 --app sn2s --mod 2 1
  16   7504  running     locked   no        file_deleter --d 2 --dont_delete_batches --input_files_only
  17   7507  running     locked   no        file_deleter1 --d 2 --dont_delete_batches --output_files_only --appid 10 --mod 2 0
  18   7516  running     locked   no        file_deleter2 --d 2 --dont_delete_batches --output_files_only --appid 10 --mod 2 1
  19   7519  running     locked   no        file_deleter3 --d 2 --dont_delete_batches --output_files_only --nappid 10

Server #2
   1  15422  running     locked   no        transitioner --d 2 --sleep_interval 1 --mod 3 0
   2  15425  running     locked   no        transitioner1 --d 2 --sleep_interval 1 --mod 3 1
   3  15428  running     locked   no        transitioner2 --d 2 --sleep_interval 1 --mod 3 2
   4  15431  running     locked   no        faah_validator --d 2 --sleep_interval 10 --app faah
   5  15434  running     locked   no        faah_assimilator --d 2 --sleep_interval 10 --app faah
   6  15437  running     locked   no        cep2_validator --d 2 --sleep_interval 10 --app cep2
   7  15442  running     locked   no        cep2_assimilator --d 2 --sleep_interval 10 --app cep2
   8  15455  running     locked   no        hpf2_validator --d 3 --sleep_interval 10 --app hpf2
   9  15461  running     locked   no        hpf2_assimilator --d 2 --sleep_interval 10 --app hpf2
  10  15471  running     locked   no        db_purge --sleep 5 --no_archive --d 2 --min_age_days 1 --mod 2 0
  11  15474  running     locked   no        db_purge1 --sleep 5 --no_archive --d 2 --min_age_days 1 --mod 2 1
  12  15477  running     locked   no        hfcc_validator --d 2 --sleep_interval 10 --app hfcc
  13  15489  running     locked   no        hfcc_assimilator --d 3 --sleep_interval 10 --app hfcc
  14  15493  running     locked   no        dsfl_validator --d 3 --sleep_interval 10 --app dsfl
  15  15499  running     locked   no        dsfl_assimilator --d 2 --sleep_interval 10 --app dsfl
  16  15512  running     locked   no        gfam_validator --d 3 --sleep_interval 10 --app gfam
  17  15523  running     locked   no        gfam_assimilator --d 2 --sleep_interval 10 --app gfam
  18  15531  running     locked   no        hcc1_assimilator --d 2 --sleep_interval 2 --app hcc1 --mod 8 4
  19  15543  running     locked   no        hcc1_assimilator1 --d 2 --sleep_interval 2 --app hcc1 --mod 8 5
  20  15555  running     locked   no        hcc1_assimilator2 --d 2 --sleep_interval 2 --app hcc1 --mod 8 6
  21  15560  running     locked   no        hcc1_assimilator3 --d 2 --sleep_interval 2 --app hcc1 --mod 8 7
  22  15569  running     locked   no        c4cw_assimilator --d 2 --sleep_interval 10 --app c4cw --mod 2 1

Variables:

--d sets the level of logging information
--sleep_interval setsthe number of seconds to wait before querying the database again for those rare times where the previous query returned nothing to do
--mod X Y means process workunit.id % X == Y (or result.id % X == Y for those daemons operating on the result table)
--min_age_days sets the number of days before deleting a workunit after all of its files have been deleted

[Jan 16, 2013 5:54:33 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Validation Running Behind?

An interesting number and distribution, 8 hcc1 assimilators spread over 2 servers, and 2 assimilators for sn2s, where all others but hcc1 have 1 (odd in result quantitative terms compared to other sciences).

(Show the back of your tongue and we see what you're eating too ;)

Thanks

[Jan 16, 2013 6:09:54 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Validation Running Behind?

And this from the Known Issues DDP thread:

We are caught up now and work is flowing freely. In order to help volunteers keep their machines contributing during these outages, we have expanded some setting that control how much can be cached. We are now using the following settings"

<daily_result_quota>300</daily_result_quota>
    <gpu_multiplier>15</gpu_multiplier>
    <initial_daily_result_quota>5</initial_daily_result_quota>
    <max_wus_to_send>30</max_wus_to_send>
    <max_wus_in_progress>90</max_wus_in_progress>
    <max_wus_in_progress_gpu>1200</max_wus_in_progress_gpu>

[Jan 16, 2013 7:13:01 PM]

The daily quota time multiplier is what a device is getting at most for a given resource, so 2 GPU's would be 300*15*2 = 9,000 a day. If 1 is more powerful than the other, the processing distribution could be different... the servers do not care FAIK. 2 cards of unequal make and you still got a chance to buffer 2,400... a good few hours. :D

[Jan 16, 2013 6:29:23 PM]

themoonscrescent
Veteran Cruncher
UK
Joined: Jul 1, 2006
Post Count: 1320
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Validation Running Behind?

I'm now in to 43 pages of pending for GPU sad

What file does the new control go in?

----------------------------------------

[Jan 17, 2013 8:20:16 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Validation Running Behind?

Have no experience with databases, but making a guess where'll possibly one index for any wu with the NEED_VALIDATE-flag set, or possibly one index for each application/NEED_VALIDATE-combination. If the former, example FAAH would need to search-through the index until finds a wu with FAAH as the application, while HCC would need to check wuid 1st. and only afterwards check if the application is HCC. If the latter, HCC would need to check the wuid while FAAH would only need to check if anything is present in it's index or otherwise sleep.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Jan 17, 2013 10:54:14 AM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:

2 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water


Re: Validation Running Behind?

Its long ago that I stopped using the "Results" page as it will not display anymore. I estimate that there are probably days were I have over 30'000 results in various states, that means 2000 pages at 15 results per page, or so. This cumulation is also due to validators issues. It seems that it becomes not manageable anymore.

When I look at the results per device it is becoming very erratic. I know that there were issues with the databases, I hope this will all settle. The GPU crunching has been an excellent stress test of the WCG infrastructure and shows the limitations.

Why not put 10 or 20 diffraction images into one WU for HCC. The crunching time will rise and there will be much less WUs to manage. The network bandwith will also be reduced as there will be less frantic traffic. HCC returns per day a number of WUs equivalent to over the sum of all other active projects. With GPU crunching the traffic has doubled.

On GPU Grid they are two sizes of WU which are sent according to the graphic card class (all NVidia type here). The powerful GPU cards like the 580GTX or 680GTX receive WUs that may have up to 8 hour runtime each.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Hypernova at Jan 17, 2013 4:08:38 PM]

[Jan 17, 2013 4:07:44 PM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:


Re: Validation Running Behind?

The last post of knreed regarding the major database issues mentions as a possibility the repacking of WUs. It goes in the right direction. But knreed please if you repacking GPU WUs just tell us well in advance so that we can adapt the app_info.xml file to change the WU type number and avoid having idle times on the dedicated machines.

----------------------------------------

[Jan 18, 2013 6:46:28 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Validation Running Behind?

Re outage announcement: http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=408639 Maybe the "various" word used in noted knreed post on the Jan.22 24 hour long outage [starting 03:00 UTC], was meant to be "varying" sizes. The idea is still to create WU's for sciences [where it's possible!], that are matched to the groups of hosts with different power [Limiting the sample to just for CPU, a centrino duo getting a WU with 2 HCC images, a I7-2600 getting WU's with 10 images or the former getting a FAAH job with 20 dockings and the latter with 100 dockings]. In all, when the average target run time of a science is 6 hours, it will be much closer to that target of 6 hours for all and not the sum of the slowest to the fastest and within runtimes ranging from 2 to 24 hours.

Here's hoping

[Jan 18, 2013 7:06:20 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Validation Running Behind?

BTW, v.v. the 2 outage notices, there's an unsaid piece of good news [unless I missed it being said] in the not so good news. Hope knreed is able to combine this on the go whilst doing the software side of the upgrades. Knowing WCG/IBM policy, let's not get ahead of ourselves, and don't set an expectation that could fly in the face [Mr. Murphy is 24/7 attentive ;O].

[Jan 18, 2013 8:44:12 AM]

[ ]