Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 822
|
![]() |
Author |
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The only part of the pipeline that work tested was the ability to send out work units to the members. I think it sent out about 70-80k before it was caught and fixed. This means that during that time, we sent out about 3 times the work we generally send out for WCG (normally we send around 40-50k between all projects). This part of the pipeline is probably the easiest part, thus it was handled well by the servers. We did not have to worry about validating 3x as many results or the uploads of 3x the results on the grid, etc...These are able to be run over time on the backend and are not things the members notice (other than validation)
I do understand the questions folks have, but I can not cause the entire infrastructure to crash. The team at World Community Grid are very keen on keeping everything running smoothly from HSTB to ARP to OPNG and everything in between. So, please there is no need for quarrels between members as I hold the keys and if we can increase it, trust me I will let everyone know. Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
thanks for the reply. sounds like you'll continue to increase WU availability as your confidence in your process and system stability grows, and that's good to hear. you could also likely leave the WU distribution quantity the same (2000/30mins) and just increase the WU size (more jobs per WU, or harder jobs per WU) to increase the work being done without negatively impacting your infrastructure. that's kind of a win-win. The complexity of the work units (harder jobs) is given to us from the researchers. These are the target/ligand pairs or jobs in a given work unit. They need to identify which combinations they would like to expand upon. Currently to get things flowing and to make sure the work being done is validated on their servers in the end, getting a baseline for these is best. I do not know what a really difficult combination is yet and I imagine it may cause problems to my work unit generator. Tweaks will probably be needed on that end. Just adding more jobs into a work unit does not solve post processing for example. We still need to validate each job within a work unit and then they perform more analysis on each pair. Thanks, -Uplinger |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2175 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Uplinger!
So, what went wrong with all these invalids, and Server Aborts? |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Look at your invalids, and server aborted first....... Things did not end well This is not normal: https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=618783001 And: https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=618613464 And there's tons more out there. I need to go into the validator logs to dig these out. Give me a few minutes to check... Edit: As for server aborts, it's because they got to 5 total results sent out. This causes the work unit to be marked as error. I should probably increase that to 7 that we've done in the past for other projects. I'll review that value tomorrow. Thanks, -Uplinger [Edit 1 times, last edit by uplinger at Apr 14, 2021 4:06:04 AM] |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2175 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There are tons of them out there. I reacted because I never have any invalids from my computers, so when the server aborts begun, I went and checked. I think I have at least two more such WU's with wingmen also going invalid. I'm not the only one who has them, and it begun after the big dump of WU's.
----------------------------------------Edit: Also, these "invalids" seems to run much shorter than usual, looking as if they weren't created correctly. [Edit 1 times, last edit by Grumpy Swede at Apr 14, 2021 4:09:42 AM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
2021-04-14 00:14:04.1624 [WU#618613464 OPNG_0002322_00114] handle_wu(): No canonical result yet
2021-04-14 00:14:04.2062 [CRITICAL] [RESULT#.....] Runs Invalid: All energy valuations were positive. 2021-04-14 00:14:04.2065 [CRITICAL] [RESULT#1627981583 OPNG_0002322_00114_2] checkGPUXml returned false which means it failed. This indicates that the answers became unsuitable and invalid from a science perspective. This is actually a really good question/problem...this is a case we did not encounter during beta testing that will probably cause a change in the validation and how things are handled. I have the result files saved from the 3 that were returned and will examine them in greater detail tomorrow (first task to do of the day). Thank you for bringing this to my attention! -Uplinger |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2175 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Uplinger. As you can see from the database, I'm sure, there's plenty of those all invalid WU's and/or Server Aborted due to too many invalids.
----------------------------------------Good luck then tomorrow, in finding the reason, and why it happened to so many of the WU's released during the "mistake" Goodnight (Good morning for me) [Edit 1 times, last edit by Grumpy Swede at Apr 14, 2021 4:23:15 AM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It appears that there could have been a grouping of target/ligands (jobs) that were not viable and thus not good drug candidates. This failed validation, but could still be a valid scientific result. I say this seems like a grouping, because I can see them in the database...previously it was 1 in a batch randomly, which could statistically be unlucky. But, this seems like a different problem and will require more review. You can see the groupings by batch, E and R are basically the same, R stands for rerun, but that means it was marked as error atleast one attempt by a group of members.
cnt batch status 1 OPNG_0000007 E 1 OPNG_0000021 E 1 OPNG_0000028 E 1 OPNG_0000097 E 1 OPNG_0000129 E 1 OPNG_0000222 E 1 OPNG_0000279 R 1 OPNG_0000412 R 1 OPNG_0000538 R 1 OPNG_0000556 R 1 OPNG_0000610 E 1 OPNG_0000610 R 1 OPNG_0000643 R 2 OPNG_0000740 R 1 OPNG_0000747 R 1 OPNG_0000765 R 1 OPNG_0000973 R 4 OPNG_0001054 R 2 OPNG_0001069 R 1 OPNG_0001074 R 1 OPNG_0001252 E 1 OPNG_0001299 R 1 OPNG_0001347 E 1 OPNG_0001350 R 1 OPNG_0001416 R 1 OPNG_0001458 R 1 OPNG_0001459 E 1 OPNG_0001459 R 1 OPNG_0001468 R 1 OPNG_0001487 E 1 OPNG_0001493 R 1 OPNG_0001706 E 2 OPNG_0001836 E 1 OPNG_0002064 E 4 OPNG_0002227 E 5 OPNG_0002248 E 9 OPNG_0002264 E 1 OPNG_0002279 E 1 OPNG_0002302 E 138 OPNG_0002322 E 8 OPNG_0002326 E 119 OPNG_0002331 E 22 OPNG_0002341 E 26 OPNG_0002347 E 44 OPNG_0002348 E 1 OPNG_0002349 E 43 OPNG_0002370 E 19 OPNG_0002374 E 50 OPNG_0002388 E 88 OPNG_0002410 E 6 OPNG_0002414 E 55 OPNG_0002424 E 6 OPNG_0002430 E 10 OPNG_0002437 E 10 OPNG_0002445 E 8 OPNG_0002449 E 9 OPNG_0002461 E 14 OPNG_0002468 E 2 OPNG_0002473 E 12 OPNG_0002477 E 8 OPNG_0002481 E 5 OPNG_0002506 E 5 OPNG_0002507 E 2 OPNG_0002533 E Thanks, -Uplinger |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2175 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, it seems as if you know what you're going to do tomorrow (as if you didn't have enough to do already)
So, good luck and hopefully it's an easy fix to that problem. |
||
|
maeax
Advanced Cruncher Joined: May 2, 2007 Post Count: 142 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This Task https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=619146340
----------------------------------------was finished without a wingman.
AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro
|
||
|
|
![]() |