Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 70
Posts: 70   Pages: 7   [ 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 15590 times and has 69 replies Next Thread
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1842
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Lots of MIP1 WUs error out

Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away with errors like this:

Result Log

Result Name: MIP1_ 00209074_ 0717_ 0--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)
</message>
<stderr_txt>
[2019- 7-20 1:52:16:] :: BOINC:: Initializing ... ok.
[2019- 7-20 2: 4:25:][2019- 7-20 2:40:24:] :: BOINC:: Initializing ... ok.
[2019- 7-20 2:40:25:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00209074.flags -out::file::silent result_silent.out -run:jran 1549787703 -nstruct 14 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
ERROR: ERROR: Comments in an option file must begin with '#', options must begin with '-' the line:

is incorrectly formatted

</stderr_txt>
]]>
and then there are WUs that error out after more than an hour of runtime with this error
Result Log

Result Name: MIP1_ 00208869_ 0254_ 0--
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>
[2019- 7-19 14:22:59:] :: BOINC:: Initializing ... ok.
[2019- 7-19 14:22:59:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00208869.flags -out::file::silent result_silent.out -run:jran 1669075901 -nstruct 19 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting work on structure: _0001
Finished _0001 in 508.891 seconds.
Starting work on structure: _0002
Finished _0002 in 499.702 seconds.
Starting work on structure: _0003
Finished _0003 in 362.968 seconds.
Starting work on structure: _0004
Finished _0004 in 294.592 seconds.
Starting work on structure: _0005
Finished _0005 in 334.232 seconds.
Starting work on structure: _0006
Finished _0006 in 439.174 seconds.
Starting work on structure: _0007
Finished _0007 in 366.025 seconds.
Starting work on structure: _0008
Finished _0008 in 346.353 seconds.
Starting work on structure: _0009
Finished _0009 in 417.646 seconds.
Starting work on structure: _0010
Finished _0010 in 292.767 seconds.
Starting work on structure: _0011


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x01CBE547 read attempt to address 0x04B403AC

Engaging BOINC Windows Runtime Debugger...
...
<lengthy debug output omitted>
...
The later are much more of a concern for me, as they add up to a serious amount of wasted run time...

And yes, those same hosts return other WUs, from MIP1 or other WCG projects just fine, and it is not restricted to a small number of hosts but pretty much across the whole range of hosts that I have running WCG (50 right now). also with various BOINC clients...

Ralf
----------------------------------------

[Jul 21, 2019 6:36:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jean-David Beyer
Senior Cruncher
USA
Joined: Oct 2, 2007
Post Count: 334
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away ...
and then there are WUs that error out after more than an hour of runtime ...


I have not noticed this. My most recent work units have all come in as valid . I am running on a RHEL6 Linux machine with a four-core Intel Xeon processor (64-bit).
----------------------------------------

[Jul 22, 2019 1:21:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7236
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away with errors like this:Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x01CBE547 read attempt to address 0x04B403AC

I have also seen this error. It has occurred on my Q6600 which runs all of the other projects just fine,but about half of the MIPS units ended in an error. I only ran the MIPS units in the interval between Zika' pause and the start of SCC again. That machine only has 2 gb of memory. I think MIPS units make such strong use of memory that there was not enough memory to avoid conflicts. I also had MCM running and I think if there was 2 or 3 MCM running and 1 or 2 MIPS running the units would finish OK. But if more than 3 or 4 MIPS were running at any one time, they would error. Once I saw the amount of MIPS which were not completing satisfactorily, I ceased running MIPS and just stuck with MCM until SCC resumed. So, my theory is there is some type of memory problem which is causing the problem for you. Good luck.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jul 22, 2019 2:06:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1842
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

That machine only has 2 gb of memory. I think MIPS units make such strong use of memory that there was not enough memory to avoid conflicts.
These errors occur on 4 core/thread PCs with 8GB of RAM, so I think that RAM isn't really an issue here...

I am likely to run MIP1 only for a couple more days, until I get my 50y Diamond badge and then run (beside the elusive HST) SCC and FAH2 from that point on...

Ralf
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TPCBF at Jul 22, 2019 3:54:52 AM]
[Jul 22, 2019 3:53:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

RAM should not be an issue as then it gets reported in the event log AND task(s) are stopped with a "waiting for memory" message until enough free RAM is available again. If it could be proven to be a memory availability problem, it's most likely a reportable BOINC bug for Github.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jul 22, 2019 12:13:59 PM]
[Jul 22, 2019 12:13:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Acid303
Cruncher
Germany
Joined: Nov 16, 2011
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Same Problem here. Update to BOINC 7.14.2 fixes the issue.
----------------------------------------

[Aug 27, 2019 9:35:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BoincST
Cruncher
Joined: Feb 25, 2010
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Lately all WUs error out on my computers after rebooting.

Result Log

Result Name: MIP1_ 00228254_ 1531_ 0--
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>
[2019- 9-26 16:37:33:] :: BOINC:: Initializing ... ok.
[2019- 9-26 16:37:33:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00228254.flags -out::file::silent result_silent.out -run:jran 2014355682 -nstruct 11 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting work on structure: _0001


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x01C7D380 read attempt to address 0x87FFFFFC


Engaging BOINC Windows Runtime Debugger...


The bold printed reason is always the same.
----------------------------------------
[Edit 1 times, last edit by BoincST at Sep 27, 2019 7:28:01 AM]
[Sep 27, 2019 7:23:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7236
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

First: Reboot
If that does not cure it, then
Second: Do a memory check
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 27, 2019 2:12:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
GridWatcher
Cruncher
Joined: Jul 9, 2007
Post Count: 6
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Hello everyone. I recently added a Win10 box (Q6600 w/8Gb RAM) into my mix. Not the best processor, but a workhorse that someone was decommissioning and was going to throw out. I set it to work on the non Covid and African weather projects as it was showing long run times on those projects.

Anyway, I've noticed that in the last week or two that the box has been throwing quite a few error results. Not all WU have errored out but more than I am used to. I've made sure it's all patched up and free of any malware/stuff that might take up resources. So far, no issues. The errors seem to be similar to the errors in the last few posts. I was wondering if anyone has figured out what the issue was...

Result Name: MIP1_ 00302712_ 0688_ 0--
<core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>
[2020- 6-11 22: 2: 4:] :: BOINC:: Initializing ... ok.
[2020- 6-11 22: 2: 4:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00302712.flags -out::file::silent result_silent.out -run:jran 982596290 -nstruct 5 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting work on structure: _0001
Finished _0001 in 1209.13 seconds.
Starting work on structure: _0002
Finished _0002 in 1771.3 seconds.
Starting work on structure: _0003


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x01D0E547 read attempt to address 0x07B043AC
[Jun 12, 2020 5:58:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

After a routine update of OS packages (Arch Linux) on 29 August, MIP1 -- and only MIP1 -- WUs are segfaulting on all my nodes.

The logs don't show much useful info:

Sep 01 16:09:13 node01 boinc[2870]: mv: cannot stat 'slots/5/result_silent.out': No such file or directory
Sep 01 16:09:13 node01 boinc[570]: 01-Sep-2020 16:09:13 [World Community Grid] Computation for task MIP1_00317797_0919_0 finished
Sep 01 16:09:13 node01 boinc[570]: 01-Sep-2020 16:09:13 [World Community Grid] Output file MIP1_00317797_0919_0_r2140855931_0 for task MIP1_00317797_0919_0 absent

The coredumps look like this, and are identical (excepting WU name, PID, and other expected variance) on all machines:

$ sudo coredumpctl info
PID: 2763 (wcgrid_mip1_ros)
UID: 977 (boinc)
GID: 977 (boinc)
Signal: 11 (SEGV)
Timestamp: Tue 2020-09-01 16:09:12 MDT (19min ago)
Command Line: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00317797.flags -out::file::silent result_silent.out -run:jran 860996629 -nstruct 10 -out::level 100 -run::no_scorefile true
Executable: /var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu
Control Group: /system.slice/boinc-client.service
Unit: boinc-client.service
Slice: system.slice
Boot ID: 9bd0990d76f24403afbf64cc28c6d0d6
Machine ID: 97712c2452e24296991d6e0957c002af
Hostname: node01
Storage: /var/lib/systemd/coredump/core.wcgrid_mip1_ros.977.9bd0990d76f24403afbf64cc28c6d0d6.2763.1598998152000000000000.zst
Message: Process 2763 (wcgrid_mip1_ros) of user 977 dumped core.

Stack trace of thread 2763:
#0 0x0000000000000000 n/a (n/a + 0x0)

And that's also not terribly useful.

I'm not sure how to get more debugging info.

  • The processes are dying after tens of minutes of execution, so it doesn't appear to be a library issue.
  • Finding a slot with a live MIP1 process and tailing its stderr.txt file doesn't provide anything useful; nothing gets dumped there during the SEGV.
  • I know that SEGVs frequently point to memory issues, but IMO that's a poor fit for the data since this is a specific executable crashing, and the issue clearly started after a software update. at the same point in time, on six machines.

For now I have opted out of MIP work. I'll try again when there's a new kernel update -- which is just me grasping at straws. If anyone has any thoughts on ways to get better troubleshooting info I'm certainly game to try.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by mdxi at Sep 1, 2020 10:46:46 PM]
[Sep 1, 2020 10:46:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 70   Pages: 7   [ 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread