Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 36
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 986 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
2023-01-16: Hmmm... Now, let's see: about a week ago the amount of MCM1 work being generated was increased, and about 6 days later there's a substantial tranche of retries. I wonder why that might be :-)From 07:40:04 UTC to 13:32:32 UTC I've only been getting re-sends (_2 and _3) of MCM1, 121 tasks in total. [Edit: I notice Unixchick referenced the relevant News post before I got my reply in!] Of course, it won't all be down to users with queue sizes that may seem excessive (by accident or by design), but when there's more work about those systems may well collect more of it... When the amount of work available is lower, I see far fewer deadline-related retries (my own or those going to wingmen), but that may be coincidental. And, of course, if there are download issues there will be more retries because of download errors, but I've not seen any of those amongst my wingmen for quite a while now... Cheers - Al. P.S. imagine what the retry scenario is going to be like when ARP1 starts up properly again :-) [Edit 1 times, last edit by alanb1951 at Jan 16, 2023 4:02:29 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I see I got a rash of _2and _3's. I looked through them and most of them were the result of an error. There was a couple of "no reply" and a couple of "server aborted" ones. The "server aborted" ones were when somebody ran past the time limit but did finish the unit before my system had a chance to start the unit. So on these, there was no time accrued by my system, but somebody else probably has a queue which is just full enough to miss a deadline, but not by much.
----------------------------------------With the disparities in the times of the units between the LOO and NFCV types, I can understand why BOINC would overload a queue. If the system got mostly the shorter units for while the queue size would accumulate enough to satisfy the parameters set by the user. When a group of the longer running units came into the queue, the amount of work to be done could almost double, resulting in some missed deadlines. A possible solution to this could be to separate the two types and allow the user to specify which type they would like to have. This would provide more uniformity in the run times and virtually eliminate the problem of inadvertent overloaded queues. Another possible solution would be to tweak the units to have more uniformity in runtimes between the two types. But that would probably involve re-writing some code so it might not be worth it. At any rate, as long as the systems don't waste any time on running some extra units when they are unnecessary, maybe nothing further needs to be done. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 609 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
users with queue sizes that may seem excessive define excessive.My slower 4-core cpu has about 15 hours in the queue. I determined 15 hours by 16 WUs at 3.5 hours each. I presume this is NOT excessive And for my own info where would I see re-send data. |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 986 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
users with queue sizes that may seem excessive define excessive.My slower 4-core cpu has about 15 hours in the queue. I determined 15 hours by 16 WUs at 3.5 hours each. I presume this is NOT excessive And for my own info where would I see re-send data. For what it's worth, I also run about 16 hours on my largest machines (which aren't exactly powerhouses...) and much lower on my laptop and other "smaller" systems which don't ever run bigger tasks such as WCG's ARP1. Limits are tuned to try to ensure I always return WCG stuff within 24 hours of receipt (unless there's a system problem), so the ability to tune download quantities via the profile is a blessing! As for re-send data, if one only wants the occasional information it's all in the results section of the web site, but there's a lot of [tedious] clicking involved! So some users make use of a WCG-provided API to collect information about individual tasks they have processed. There was a version of the API from before the web-site overhaul, but it was not very flexible. With the web-site changes came a newer API which allows collection of work-unit data as well as information about one's own results (akin to what the web site provides), so it's possible to see which wingmen failed and why... I've got some Python scripts that do this, specific to my needs. User adriverhoef has a useful general application (which may or may not be Linux only - not sure about that). There are various other ways of getting at the information, including a web-page based trial interface to the new API. I found out what I needed to know by forum searches (much to my surprise!), but I suspect that trying to describe it in full detail would run over some limit on message size :-) Cheers - Al. |
||
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 609 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
AH! then not guilty.
----------------------------------------adriverhoef has a useful general application Got those in the past. Still have them.Not really checking stats anymore. I just let the machines run themselves. Check up on them 2-3 times a day. TN-Grid fills the holes when they run out of WCG. I Still have not been able to turn off other projects without running dry. Tried 3 times so far. [Edit 1 times, last edit by BobbyB at Jan 16, 2023 5:40:51 PM] |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 303 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As to queue sizes:
If you always return work well before deadlines your queue sizes aren't excessive :-) YOU, are absolutely correct! But, when the WU is returned MORE than six (6) days after the original WU was received by that volunteer's system and THEIR processing time is LESS than two (2) hours, Wouldn't you agree that their queue size IS EXCESSIVE? |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 986 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As to queue sizes: If they are "persistent offenders" then yes, I'd agree that there's an issue[1]-- there are certainly users out there who have work-load mixes that seem to regularly see results returned close to deadlines, which is extremely bad for the project(s) concerned if (like ARP1 and HST1) they can't advance a specific data item until the previous unit of work is completed. Whether some of these are late returners because they don't have "always on" networks may, however, be a factor in a few cases, as we only get to see when the task was returned, not when it was actually finished.If you always return work well before deadlines your queue sizes aren't excessive :-) YOU, are absolutely correct! But, when the WU is returned MORE than six (6) days after the original WU was received by that volunteer's system and THEIR processing time is LESS than two (2) hours, Wouldn't you agree that their queue size IS EXCESSIVE? Personally, I'd like to see a reduction in the default deadlines for shorter-running applications such as MCM1, OPN1 and SCC1 (if/when it returns), combined with the return of the grace day for those sub-projects -- that would result in active clients killing off tasks earlier, which might shift the balance away from "No Reply" retries to "Not Started by Deadline" retries (which should show up as Errors a day earlier than No Reply), and would help reduce the irritation associated with having tasks Server Aborted because a No Reply task actually turned up a few hours late, as the client would assume the shorter deadline applied... Cheers - Al. [1] As (unlike the case with a lot of other BOINC projects) we [as users] can't actually see a user's recent work returns in fine detail, and can't determine what other projects they might be running, we can't tell whether it's excessive queues, inter-project clashes or who knows what else. We might be fairly sure, but... :-) [Edit 1 times, last edit by alanb1951 at Jan 16, 2023 6:15:54 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2172 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
BobbyB wrote:
And for my own info where would I see re-send data. One way is to go to Results, press the End-key, set the 'Items per page' to a large enough number, scroll through the results, find the beginning and end of a certain pattern that you are looking for, select all those lines, copy and paste them to a line counter program (such as 'wc' on Linux) and presto! Another way, as alanb1951 replied: So some users make use of a WCG-provided API to collect information about individual tasks they have processed. I've got some Python scripts that do this, specific to my needs. User adriverhoef has a useful general application (which may or may not be Linux only - not sure about that). It is Linux only — its homepage begins with "This is software for Linux" ![]() Using wcgstats from that software package to carry out this task involves some knowledge about the commandline and what files are used. (...) I did this: $ wcgstats -w*MCM1_......._...._3 -S At this point I typed ^Z to suspend wcgstats temporarily. Then I located the temporary files used by wcgstats in this session: $ ls -t /tmp/wcgstats.* | head -6 The needed file is the one with ".entries." in its name. Then I used 'less -N' to scroll through the file with linenumbers, searched for "_2<TAB>" (three characters) to find the first occurrence, noted the linenumber, searched for "_[01]<TAB>" (6 characters), noted the linenumber; then subtracted the two linenumbers and the answer was 121. ![]() |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1957 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Going to have to start looking at other distributed computing projects. Good luck....I get plenty of MCM1 WUs on all my hosts, Windows, Linux, Android (don't have a macOS machine running at this time though), wtih no issues what so ever.. I did notice a lot of resends the last couple of days, _2 and today even a bunch of _3, but certainly not a general problem at WCG at the moment that would result in any info post (well, Cyclops posted a general new update on 1/13) here in the forum or on the Facebook page... Ralf ![]() |
||
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 609 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Going to have to start looking at other distributed computing projects Tn-GridFollow the instructions on their front page: create an account first. |
||
|
|
![]() |