Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
Member(s) browsing this thread: Unixchick , TonyEllis
Thread Status: Active
Total posts in this thread: 3315
Posts: 3315   Pages: 332   [ Previous Page | 246 247 248 249 250 251 252 253 254 255 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3307108 times and has 3314 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 981
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Adri,

Recognized that machine name :-) It also provided my two recent SIGSEGV examples, but for two different work units;

    ARP1_0002741_135_1 sent 2023-01-23T18:45:48 returned 2023-01-25T14:08:22

ARP1_0001793_137_0 sent 2023-01-25T12:41:29 returned 2023-01-26T20:32:29

The former task seemed to have been restarted after the third checkpoint and crashed without reaching the next one. The latter crashed after 5 checkpoints.

I have seen quite a few SIGSEGV returns for otherwise valid ARP1 units since the start of the migration process (I wasn't recording wingman data until then...), and the vast majority of them were down to a couple of hosts - this is the first time I've seen this one.

I guess we'll never know why this happens...

Cheers - Al.
[Jan 30, 2023 9:54:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2209
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Finally my "Oops" ARP1_0010948_136_3 task, is finished and validated. My old i7-3630QM CPU, isn't the fastest on Earth, but at least I finished this ARP faster than my wingman. biggrin
----------------------------------------
[Edit 3 times, last edit by Grumpy Swede at Jan 31, 2023 8:31:07 AM]
[Jan 31, 2023 8:24:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2171
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Al,
Remember a device called Ryzen-OneHorseShay? Of course you do. smile Twas yesterday that we discussed it.

New developments this time, on the one hand there's a SIGSEGV:
ARP1_0033822_137_1  Linux Ubuntu  Error  2023-01-25T12:41:29  2023-01-26T12:08:18    2.28/2.30      72.8/0.0   
Devicename: Ryzen-OneHorseShay
Logfile:
<core_client_version>7.17.0</core_client_version>
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[03:04:22] INFO: Checkpoint taken at 2019-04-01_06:00:00
SIGSEGV: segmentation violation
Stack trace (18 frames):
[0x2d13b72]
[0x2da0400]
[0x1ed9107]
[0x1e9c664]
[0x1e9444a]
[0x1e8997c]
[0x188518c]
[0x1b6f8e2]
[0x135f570]
[0x11f86d4]
[0x5848b7]
[0x584ece]
[0x448f61]
[0x4475c9]
[0x440967]
[0x2eb2344]
[0x2eb25c1]
[0x405466]

Exiting...

</stderr_txt>
with one wingman Pending Validation,
and on the other hand there's something else:
ARP1_0020353_137_0  Linux Ubuntu  Error  2023-01-25T12:41:29  2023-01-25T21:53:46    2.69/2.70      84.0/0.0   
Devicename: Ryzen-OneHorseShay
Logfile:
<core_client_version>7.17.0</core_client_version>
<message>
process exited with code 37 (0x25, -219)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[13:31:13] INFO: Checkpoint taken at 2019-04-01_06:00:00
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 771
ZDC + Z0C + 2m is larger than the 1st WRF level Stop in subroutine urban - change ZDC and Z0C
-------------------------------------------
ERROR:wrf_abort
14:53:40 (19067): called boinc_finish(293)

</stderr_txt>

Here is also one other wingman Pending Validation.

Adri
[Jan 31, 2023 7:21:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 981
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Adri,

That second one is interesting, being a data corruption that the software could spot and identify in enough detail! Unfortunately, the same isn't true for SIGSEGV without a symbol table and the source code! :-)

I've just looked at my latest wingmen and I note another SIGSEGV (ARP1_0001793_137_0) and an Invalid (ARP1_0005836_138_1) from that name. The former is waiting for another wingman (mine is Pending Validation) and the latter validated with another wingman...

For what it's worth, a machine with that name has also been an MCM1 wingman of mine on five occasions and all of them were valid. It looks as if there's something in the ARP1 code that upsets this particular machine, doesn't it?

As we don't know anything about the hardware of individual users at WCG it's not possible to get an impression of whether the various other lone SIGSEGVs that I've seen are from one specific hardware set[1] (eg, [if AMD] early Ryzen or Threadripper) or whether it's mostly "random" -- it does tend to mean that even if it could be fixed it won't be :-(

Cheers - Al.

[1] There have been instances in the past where a project application was more likely to fail on some AMD hardware than Intel... Can't remember off-hand which application(s), though

[Edited to note two more failures]
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Feb 1, 2023 1:57:21 AM]
[Feb 1, 2023 12:56:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 993
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Finally my "Oops" ARP1_0010948_136_3 task, is finished and validated. My old i7-3630QM CPU, isn't the fastest on Earth, but at least I finished this ARP faster than my wingman. biggrin


This is the speed of my machine now. It is an upgrade from the machine that did a bunch of ARPs. It isn't so much about the speed as being reliable and getting it done, and you did that well. I know spending the money on the energy is hard at this time Grumpy Swede, so I'm glad you are still participating.

I'm enjoying the odd ARP resend when it finds its way to me. I hope the project gets their equipment fixed soon, as I would like to have ARP regularly without manually managing downloads or queue length.
[Feb 1, 2023 2:55:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7696
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

ARP1_0022054_140_2 in progress.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Feb 1, 2023 6:41:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12435
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I gather that the hold up at Delft is that they needed more data storage than the Uni could allow so they had to get their own storage. This would take time to obtain, install & get approval before operation.

Mike
[Feb 1, 2023 9:35:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12435
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Sgt. Joe's re-send indicates that we are about at the end of the road for resends except for maybe a few that go for a second re-send.

Mike
[Feb 1, 2023 9:38:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12435
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

We are now in round 3 of the recent releases.

Mike
[Feb 2, 2023 3:19:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 993
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

on my last 2 ARP WUs. Not sure if I'll get any more resends. I'm going to give my machine a good clean and update once they are done. Looking forward to when we get more...

Yet again I hope they take this short (please let it be short) pause time to send out the extremes.
[Feb 2, 2023 3:45:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3315   Pages: 332   [ Previous Page | 246 247 248 249 250 251 252 253 254 255 | Next Page ]
[ Jump to Last Post ]
Post new Thread