Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
Member(s) browsing this thread: pwhidden
Thread Status: Active
Total posts in this thread: 3321
Posts: 3321   Pages: 333   [ Previous Page | 320 321 322 323 324 325 326 327 328 329 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3320507 times and has 3320 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Unixchick was correct except for 1 minor point.

Unlike many of BOINCs other projects, WCG has never had progress information. Many have asked so we have had to improvise for ourselves. I hope that my daily & weekly reports are of interest.

ARP is very different from any other project in that it computes the weather in 35609 different squares of Sub-Saharan Africa. Each unit computes the weather in that square for 2 days in 24 second steps. Each checkpoint covers 6 hours.

The name of the unit can be broken down into 4 sections. The first is ARP1 for the project. The second is a 7 digit number 0000000 to 0035608 to denote the square we are computing. The third is a 3 digit number from 000 to 183 which refers to the time period (generation) (000 was from 00:00 on 1 July to 23:59 on 2 July). The fourth is a single digit to indicate the copy of the unit that you have.

Each square started in generation 000. As each unit was returned and validated, the result was compared with the weather that actually occurred in those 2 days. If OK, the next 48 hours (generation) would be released.

Because different crunchers took different amounts of time to return their completed copies, the different squares gradually got further and further out of line with each other.

Then we had units for which the computations did not match the actual weather, so they got further out of step. Many of these were cured by reducing their TimeStep from 24 seconds to 18 seconds. However, some have not responded to this adjustment, especally the 3 so-called "ultras" in generations 021 & 022.

This was starting to become a problem not long before IBM handed over to Krembil. I persuaded the IBM tech, to let us have a regular report of the situation. Persuaded was probably the wrong word because he had already got it in mind to do (but might otherwise have run out of time before handover).

This is the main report: https://download.worldcommunitygrid.org/boinc...rp1_stats/generations.txt

After Krembil took over, the situation became worse, but we were able to persuade them to do something about the generation spread. This year, they have been supplying units that were ready to go on a generation by generation basis. This has resulted in workable units closing in on generation 143.

A few days ago I started chasing the 1 remaining moving unit that was still accelerated. I presume that the rest and the extremes are all stuck (759).

The most advanced units reached generation 146 last year.Those within 10 generations had been classified as 'normals', the next 5 as 'accelerated' (132 to 136 and the earlier generations as 'extremes' (104 to 131). The earlier attempts to catch up involved halving the deadlines for accelerated units and also sending 3 copies for extremes. They were already halving the deadlines for resends after the mid-point of the original deadline, so some had 1.5 days and others 3 or 6 days.

Perhaps someone could peer review this and Unixchick put a link on Front Page.

Mike
[Apr 7, 2025 9:34:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2172
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Boca Raton Community HS:
We see a good amount of ARP1 work units and are happy to provide whatever information that could be helpful, but we are wondering what exactly we can share that can be of help.

The best way to track the progress of ARP1 is to let the computer do the work. Let one computer with many running tasks determine which ARP1-tasks are in progress and compile a sorted list. (Or even better, also let a number of other members' computers do that. Then let one member's computer collect the results from all those computers.)

You could do this yourself by starting to let the computer compile a sorted list of tasks and publish the results on a webpage, e.g. hourly.

Would you like to see an example? You could take a peek here on my website.

Adri
PS Just as I tried posting this, my Internet connection went belly up for a (long) while crying so I'm now borrowing a smartphone and activated it as a hotspot for Internet usage.
----------------------------------------
[Edit 3 times, last edit by adriverhoef at Apr 8, 2025 12:40:46 AM]
[Apr 8, 2025 12:30:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 986
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Mike,

I hadn't answered the original Boca Raton query because I expected a reply from Unixchick or yourself...

In response to your "peer review" comment -- your message seems to be [mostly] factually accurate (and you may be far better informed than I am!)

Details regarding how they decide which generations to process at any given time (and whether it's actually good strategy to bunch everything up) would be opinion-driven![*1]

My questions would relate to what causes particular cells to stall, and what might be done about them. It was my understanding that a lot of the known stalled tasks were ones that had actually failed to produce enough results because the model iteration step was too long to handle the local conditions of the cell -- SIGSEGV or its Windows equivalent was quite a common failure in those cases and they never got as far as validation! Some other tasks undoubtedly failed for other reasons (and a simple re-issue would suffice), and some may have died when validation [or other post-processing?] couldn't succeed.

(You may recall the watch we once had for units that were getting Error returns, with responses from lots of folks, and [eventual] input from Kevin at WCG; that was quite a while ago now!)

As for validation itself -- I had presumed that the first phase of validation was simply the bitwise identity comparison of result files (done, I believe, with an MD5 checksum [or similar]). Whether a check against the expected outcome was then done as a further part of validation and, if so, how close a match was required, is unknown to me (I don't recall seeing it mentioned anywhere, but that doesn't mean it isn't the case!)

I seem to recall reading that Delft used to have to approve cells moving on to the next generation; that would imply it was outside the validation process, but I might be misremembering.

As I said, your original was mostly o.k. as is -- I'm not sure whether any of what I've mentioned might help [without over-complicating things], but it's what I've got :-)

Cheers - Al.

*1 For what it's worth, I think that with the present three groupings of generations it's not good strategy as it will increase the number of units that end up taking 7+ days to resolve because of missed deadlines on "Normal" tasks... I did use the word "opinion" :-)
[Apr 8, 2025 12:43:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Thanks, Al & Adri.

I was trying to record how things worked and what has happened from a lay perspective because I thought that it would be better for the greater number of crunchers and because it is nearly 50 years since I was involved in main frame work as a system tester on behalf of a user department, although I was able at the time to read and interpret the programming. Since then I have been involved mostly in spreadsheet programming.

As far as I am aware we have never been informed as to why units have stuck. I was aware that there was a comparison of results with the actual weather that had occurred as provided by the Weather Channel, although whether by WCG or Delft I do not know. I had assumed that the problems were due to the local terrain as there appeared to be problems with groups of squares. I am also unaware as to how much effort might or might not have been put into trying to unstick units.

As all the stuck units in generations up to 124 only have 1 or 2 units per generation, I have assumed that there has been more of an attempt made for those generations than for subsequent generations. I may be mistaken there.

Mike
[Apr 8, 2025 1:57:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 986
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Mike,

I sometimes wonder whether there's anything in the [ex-IBM] ARP1 toolkit to locate (and explain the reason for) stuck units! Also, it would be interesting to know how trivial (or otherwise) it is to try to re-issue a unit if there's no obvious reason it got stuck in the first place (mis-tagged data, perhaps?)

I think we're all a little frustrated at the lack of that sort of information, but it wasn't exactly forthcoming when WCG had several IBM staff working on it, so it's a bit difficult to expect the current folks to suddenly provide the answers :-)

Ah, well; at least we can process what we get sent... And on that note, it would be nice if more units were released each day when the data centre gets back to full capacity, but we'll just have to wait and see.

Cheers - Al.
[Apr 8, 2025 2:29:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Al

IBM may have had more resources but they didn't start to try to clear the stuck units until shortly before handover so as to pass over better data. Their first strategy was to reduce the TimeStep.

After that it was to retrace a couple of generations and restart with reduced TimeStep. However I am unsure as to whether they did any retracing - Adri might be able to say. If not then that would be an idea for Krembil to try.

Mike
[Apr 8, 2025 2:51:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Again no movement of extremes. There are 318, all of which appear to be stuck.

No accelerated units validated. 1 is now in generation 136. I think that it is the only accelerated moving out of 442. If anyone has it, please expedite it.

555 normals validated yesterday. There are now 6,419 normals in the generations being released, many of which may be stuck, particularly in generations 137 to 141.

There are now 26,882 held up in generation 143,whjch may start to be released soon as we are now about 95% of the way through generation 142.

There has been an irregular increase in downloads.

Mike
[Apr 8, 2025 3:03:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

After that it was to retrace a couple of generations and restart with reduced TimeStep. However I am unsure as to whether they did any retracing - Adri might be able to say. If not then that would be an idea for Krembil to try.

Instead of trial and error, since the work belongs to Delft, why doesn't Krembil send a email to Nick van de Giesen's team and ask for assistance in debugging and suggestions in getting this work to continue?
[Apr 8, 2025 5:24:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 831
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Hi!
1.st one received in 4 days:

ARP1_0034203, so closer to next generation ?

Another 1, late evening:

ARP1_0034472
----------------------------------------
[Edit 1 times, last edit by Hans Sveen at Apr 8, 2025 6:52:07 PM]
[Apr 8, 2025 5:37:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Hans

That is 97%

Mike
[Apr 8, 2025 9:57:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3321   Pages: 333   [ Previous Page | 320 321 322 323 324 325 326 327 328 329 | Next Page ]
[ Jump to Last Post ]
Post new Thread