Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 57
Posts: 57   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4632 times and has 56 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

Afterthought for ingleside: Whilst the proposed validator load splitting strikes on the face as logical and efficient, the results for all sciences seem to run one long series... we're now somewhere at 616 million originals, before splitting them for quorum [You see those numbers running past when updating with WCGDAWS]. Just wondered how that works when there are multiple sciences, a validator or validators per science. Feel like these could be running cycles to find out, so guess there's working off some subset table or secondary indices to find which ones they should be looking at.
[Jan 16, 2013 3:23:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

Here are the backend daemons that are currently running:

Server #1
1 7430 running locked no c4cw_validator --d 2 --sleep_interval 10 --app c4cw
2 7433 running locked no c4cw_assimilator --d 2 --sleep_interval 10 --app c4cw --mod 2 0
3 7436 running locked no hcc1_validator --d 2 --sleep_interval 2 --app hcc1 --mod 6 0
4 7461 running locked no hcc1_validator1 --d 2 --sleep_interval 2 --app hcc1 --mod 6 1
5 7464 running locked no hcc1_validator2 --d 2 --sleep_interval 2 --app hcc1 --mod 6 2
6 7467 running locked no hcc1_validator3 --d 2 --sleep_interval 2 --app hcc1 --mod 6 3
7 7470 running locked no hcc1_validator4 --d 2 --sleep_interval 2 --app hcc1 --mod 6 4
8 7473 running locked no hcc1_validator5 --d 2 --sleep_interval 2 --app hcc1 --mod 6 5
9 7477 running locked no hcc1_assimilator --d 2 --sleep_interval 2 --app hcc1 --mod 8 0
10 7486 running locked no hcc1_assimilator1 --d 2 --sleep_interval 2 --app hcc1 --mod 8 1
11 7489 running locked no hcc1_assimilator2 --d 2 --sleep_interval 2 --app hcc1 --mod 8 2
12 7492 running locked no hcc1_assimilator3 --d 2 --sleep_interval 2 --app hcc1 --mod 8 3
13 7495 running locked no sn2s_validator --d 2 --sleep_interval 10 --app sn2s
14 7498 running locked no sn2s_assimilator --d 2 --sleep_interval 10 --app sn2s --mod 2 0
15 7501 running locked no sn2s_assimilator1 --d 2 --sleep_interval 10 --app sn2s --mod 2 1
16 7504 running locked no file_deleter --d 2 --dont_delete_batches --input_files_only
17 7507 running locked no file_deleter1 --d 2 --dont_delete_batches --output_files_only --appid 10 --mod 2 0
18 7516 running locked no file_deleter2 --d 2 --dont_delete_batches --output_files_only --appid 10 --mod 2 1
19 7519 running locked no file_deleter3 --d 2 --dont_delete_batches --output_files_only --nappid 10

Server #2
1 15422 running locked no transitioner --d 2 --sleep_interval 1 --mod 3 0
2 15425 running locked no transitioner1 --d 2 --sleep_interval 1 --mod 3 1
3 15428 running locked no transitioner2 --d 2 --sleep_interval 1 --mod 3 2
4 15431 running locked no faah_validator --d 2 --sleep_interval 10 --app faah
5 15434 running locked no faah_assimilator --d 2 --sleep_interval 10 --app faah
6 15437 running locked no cep2_validator --d 2 --sleep_interval 10 --app cep2
7 15442 running locked no cep2_assimilator --d 2 --sleep_interval 10 --app cep2
8 15455 running locked no hpf2_validator --d 3 --sleep_interval 10 --app hpf2
9 15461 running locked no hpf2_assimilator --d 2 --sleep_interval 10 --app hpf2
10 15471 running locked no db_purge --sleep 5 --no_archive --d 2 --min_age_days 1 --mod 2 0
11 15474 running locked no db_purge1 --sleep 5 --no_archive --d 2 --min_age_days 1 --mod 2 1
12 15477 running locked no hfcc_validator --d 2 --sleep_interval 10 --app hfcc
13 15489 running locked no hfcc_assimilator --d 3 --sleep_interval 10 --app hfcc
14 15493 running locked no dsfl_validator --d 3 --sleep_interval 10 --app dsfl
15 15499 running locked no dsfl_assimilator --d 2 --sleep_interval 10 --app dsfl
16 15512 running locked no gfam_validator --d 3 --sleep_interval 10 --app gfam
17 15523 running locked no gfam_assimilator --d 2 --sleep_interval 10 --app gfam
18 15531 running locked no hcc1_assimilator --d 2 --sleep_interval 2 --app hcc1 --mod 8 4
19 15543 running locked no hcc1_assimilator1 --d 2 --sleep_interval 2 --app hcc1 --mod 8 5
20 15555 running locked no hcc1_assimilator2 --d 2 --sleep_interval 2 --app hcc1 --mod 8 6
21 15560 running locked no hcc1_assimilator3 --d 2 --sleep_interval 2 --app hcc1 --mod 8 7
22 15569 running locked no c4cw_assimilator --d 2 --sleep_interval 10 --app c4cw --mod 2 1


Variables:

--d sets the level of logging information
--sleep_interval setsthe number of seconds to wait before querying the database again for those rare times where the previous query returned nothing to do
--mod X Y means process workunit.id % X == Y (or result.id % X == Y for those daemons operating on the result table)
--min_age_days sets the number of days before deleting a workunit after all of its files have been deleted
[Jan 16, 2013 5:54:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

An interesting number and distribution, 8 hcc1 assimilators spread over 2 servers, and 2 assimilators for sn2s, where all others but hcc1 have 1 (odd in result quantitative terms compared to other sciences).

(Show the back of your tongue and we see what you're eating too ;)

Thanks
[Jan 16, 2013 6:09:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

And this from the Known Issues DDP thread:
We are caught up now and work is flowing freely. In order to help volunteers keep their machines contributing during these outages, we have expanded some setting that control how much can be cached. We are now using the following settings"
<daily_result_quota>300</daily_result_quota>
<gpu_multiplier>15</gpu_multiplier>
<initial_daily_result_quota>5</initial_daily_result_quota>
<max_wus_to_send>30</max_wus_to_send>
<max_wus_in_progress>90</max_wus_in_progress>
<max_wus_in_progress_gpu>1200</max_wus_in_progress_gpu>
[Jan 16, 2013 7:13:01 PM]
The daily quota time multiplier is what a device is getting at most for a given resource, so 2 GPU's would be 300*15*2 = 9,000 a day. If 1 is more powerful than the other, the processing distribution could be different... the servers do not care FAIK. 2 cards of unequal make and you still got a chance to buffer 2,400... a good few hours. :D
[Jan 16, 2013 6:29:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
themoonscrescent
Veteran Cruncher
UK
Joined: Jul 1, 2006
Post Count: 1320
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

I'm now in to 43 pages of pending for GPU sad

What file does the new control go in?
----------------------------------------


[Jan 17, 2013 8:20:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

Afterthought for ingleside: Whilst the proposed validator load splitting strikes on the face as logical and efficient, the results for all sciences seem to run one long series... we're now somewhere at 616 million originals, before splitting them for quorum [You see those numbers running past when updating with WCGDAWS]. Just wondered how that works when there are multiple sciences, a validator or validators per science. Feel like these could be running cycles to find out, so guess there's working off some subset table or secondary indices to find which ones they should be looking at.

Have no experience with databases, but making a guess where'll possibly one index for any wu with the NEED_VALIDATE-flag set, or possibly one index for each application/NEED_VALIDATE-combination. If the former, example FAAH would need to search-through the index until finds a wu with FAAH as the application, while HCC would need to check wuid 1st. and only afterwards check if the application is HCC. If the latter, HCC would need to check the wuid while FAAH would only need to check if anything is present in it's index or otherwise sleep.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Jan 17, 2013 10:54:14 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

Its long ago that I stopped using the "Results" page as it will not display anymore. I estimate that there are probably days were I have over 30'000 results in various states, that means 2000 pages at 15 results per page, or so. This cumulation is also due to validators issues. It seems that it becomes not manageable anymore.

When I look at the results per device it is becoming very erratic. I know that there were issues with the databases, I hope this will all settle. The GPU crunching has been an excellent stress test of the WCG infrastructure and shows the limitations.

Why not put 10 or 20 diffraction images into one WU for HCC. The crunching time will rise and there will be much less WUs to manage. The network bandwith will also be reduced as there will be less frantic traffic. HCC returns per day a number of WUs equivalent to over the sum of all other active projects. With GPU crunching the traffic has doubled.

On GPU Grid they are two sizes of WU which are sent according to the graphic card class (all NVidia type here). The powerful GPU cards like the 580GTX or 680GTX receive WUs that may have up to 8 hour runtime each.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Hypernova at Jan 17, 2013 4:08:38 PM]
[Jan 17, 2013 4:07:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

The last post of knreed regarding the major database issues mentions as a possibility the repacking of WUs. It goes in the right direction. But knreed please if you repacking GPU WUs just tell us well in advance so that we can adapt the app_info.xml file to change the WU type number and avoid having idle times on the dedicated machines.
----------------------------------------

[Jan 18, 2013 6:46:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

Re outage announcement: http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=408639 Maybe the "various" word used in noted knreed post on the Jan.22 24 hour long outage [starting 03:00 UTC], was meant to be "varying" sizes. The idea is still to create WU's for sciences [where it's possible!], that are matched to the groups of hosts with different power [Limiting the sample to just for CPU, a centrino duo getting a WU with 2 HCC images, a I7-2600 getting WU's with 10 images or the former getting a FAAH job with 20 dockings and the latter with 100 dockings]. In all, when the average target run time of a science is 6 hours, it will be much closer to that target of 6 hours for all and not the sum of the slowest to the fastest and within runtimes ranging from 2 to 24 hours.

Here's hoping
[Jan 18, 2013 7:06:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Validation Running Behind?

BTW, v.v. the 2 outage notices, there's an unsaid piece of good news [unless I missed it being said] in the not so good news. Hope knreed is able to combine this on the go whilst doing the software side of the upgrades. Knowing WCG/IBM policy, let's not get ahead of ourselves, and don't set an expectation that could fly in the face [Mr. Murphy is 24/7 attentive ;O].
[Jan 18, 2013 8:44:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 57   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread