Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 70
|
![]() |
Author |
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away with errors like this:
----------------------------------------Result Log Result Name: MIP1_ 00209074_ 0717_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1 (0xffffffff) </message> <stderr_txt> [2019- 7-20 1:52:16:] :: BOINC:: Initializing ... ok. [2019- 7-20 2: 4:25:][2019- 7-20 2:40:24:] :: BOINC:: Initializing ... ok. [2019- 7-20 2:40:25:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00209074.flags -out::file::silent result_silent.out -run:jran 1549787703 -nstruct 14 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached ERROR: ERROR: Comments in an option file must begin with '#', options must begin with '-' the line: is incorrectly formatted </stderr_txt> ]]> Result Log The later are much more of a concern for me, as they add up to a serious amount of wasted run time...Result Name: MIP1_ 00208869_ 0254_ 0-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741819 (0xc0000005)</message> <stderr_txt> [2019- 7-19 14:22:59:] :: BOINC:: Initializing ... ok. [2019- 7-19 14:22:59:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00208869.flags -out::file::silent result_silent.out -run:jran 1669075901 -nstruct 19 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting work on structure: _0001 Finished _0001 in 508.891 seconds. Starting work on structure: _0002 Finished _0002 in 499.702 seconds. Starting work on structure: _0003 Finished _0003 in 362.968 seconds. Starting work on structure: _0004 Finished _0004 in 294.592 seconds. Starting work on structure: _0005 Finished _0005 in 334.232 seconds. Starting work on structure: _0006 Finished _0006 in 439.174 seconds. Starting work on structure: _0007 Finished _0007 in 366.025 seconds. Starting work on structure: _0008 Finished _0008 in 346.353 seconds. Starting work on structure: _0009 Finished _0009 in 417.646 seconds. Starting work on structure: _0010 Finished _0010 in 292.767 seconds. Starting work on structure: _0011 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x01CBE547 read attempt to address 0x04B403AC Engaging BOINC Windows Runtime Debugger... ... <lengthy debug output omitted> ... And yes, those same hosts return other WUs, from MIP1 or other WCG projects just fine, and it is not restricted to a small number of hosts but pretty much across the whole range of hosts that I have running WCG (50 right now). also with various BOINC clients... Ralf ![]() |
||
|
Jean-David Beyer
Senior Cruncher USA Joined: Oct 2, 2007 Post Count: 335 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away ... and then there are WUs that error out after more than an hour of runtime ... I have not noticed this. My most recent work units have all come in as valid . I am running on a RHEL6 Linux machine with a four-core Intel Xeon processor (64-bit). ![]() |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7655 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Most recently, I see a lot of WUs, more or less randomly (have not checked against certain batches) error out pretty much right away with errors like this:Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x01CBE547 read attempt to address 0x04B403AC I have also seen this error. It has occurred on my Q6600 which runs all of the other projects just fine,but about half of the MIPS units ended in an error. I only ran the MIPS units in the interval between Zika' pause and the start of SCC again. That machine only has 2 gb of memory. I think MIPS units make such strong use of memory that there was not enough memory to avoid conflicts. I also had MCM running and I think if there was 2 or 3 MCM running and 1 or 2 MIPS running the units would finish OK. But if more than 3 or 4 MIPS were running at any one time, they would error. Once I saw the amount of MIPS which were not completing satisfactorily, I ceased running MIPS and just stuck with MCM until SCC resumed. So, my theory is there is some type of memory problem which is causing the problem for you. Good luck. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That machine only has 2 gb of memory. I think MIPS units make such strong use of memory that there was not enough memory to avoid conflicts. These errors occur on 4 core/thread PCs with 8GB of RAM, so I think that RAM isn't really an issue here...I am likely to run MIP1 only for a couple more days, until I get my 50y Diamond badge and then run (beside the elusive HST) SCC and FAH2 from that point on... Ralf ![]() [Edit 1 times, last edit by TPCBF at Jul 22, 2019 3:54:52 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
RAM should not be an issue as then it gets reported in the event log AND task(s) are stopped with a "waiting for memory" message until enough free RAM is available again. If it could be proven to be a memory availability problem, it's most likely a reportable BOINC bug for Github.
----------------------------------------[Edit 1 times, last edit by Former Member at Jul 22, 2019 12:13:59 PM] |
||
|
Acid303
Cruncher Germany Joined: Nov 16, 2011 Post Count: 12 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Same Problem here. Update to BOINC 7.14.2 fixes the issue.
----------------------------------------![]() |
||
|
BoincST
Cruncher Joined: Feb 25, 2010 Post Count: 12 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Lately all WUs error out on my computers after rebooting.
----------------------------------------Result Log Result Name: MIP1_ 00228254_ 1531_ 0-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741819 (0xc0000005)</message> <stderr_txt> [2019- 9-26 16:37:33:] :: BOINC:: Initializing ... ok. [2019- 9-26 16:37:33:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00228254.flags -out::file::silent result_silent.out -run:jran 2014355682 -nstruct 11 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting work on structure: _0001 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x01C7D380 read attempt to address 0x87FFFFFC Engaging BOINC Windows Runtime Debugger... The bold printed reason is always the same. [Edit 1 times, last edit by BoincST at Sep 27, 2019 7:28:01 AM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7655 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
First: Reboot
----------------------------------------If that does not cure it, then Second: Do a memory check Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
GridWatcher
Cruncher Joined: Jul 9, 2007 Post Count: 6 Status: Offline |
Hello everyone. I recently added a Win10 box (Q6600 w/8Gb RAM) into my mix. Not the best processor, but a workhorse that someone was decommissioning and was going to throw out. I set it to work on the non Covid and African weather projects as it was showing long run times on those projects.
Anyway, I've noticed that in the last week or two that the box has been throwing quite a few error results. Not all WU have errored out but more than I am used to. I've made sure it's all patched up and free of any malware/stuff that might take up resources. So far, no issues. The errors seem to be similar to the errors in the last few posts. I was wondering if anyone has figured out what the issue was... Result Name: MIP1_ 00302712_ 0688_ 0-- <core_client_version>7.16.5</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741819 (0xc0000005)</message> <stderr_txt> [2020- 6-11 22: 2: 4:] :: BOINC:: Initializing ... ok. [2020- 6-11 22: 2: 4:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_windows_intelx86 -in::file::zip MIP1_databasev2.zip @./MIP1_00302712.flags -out::file::silent result_silent.out -run:jran 982596290 -nstruct 5 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting work on structure: _0001 Finished _0001 in 1209.13 seconds. Starting work on structure: _0002 Finished _0002 in 1771.3 seconds. Starting work on structure: _0003 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x01D0E547 read attempt to address 0x07B043AC |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After a routine update of OS packages (Arch Linux) on 29 August, MIP1 -- and only MIP1 -- WUs are segfaulting on all my nodes.
----------------------------------------The logs don't show much useful info:
The coredumps look like this, and are identical (excepting WU name, PID, and other expected variance) on all machines:
And that's also not terribly useful. I'm not sure how to get more debugging info.
For now I have opted out of MIP work. I'll try again when there's a new kernel update -- which is just me grasping at straws. If anyone has any thoughts on ways to get better troubleshooting info I'm certainly game to try. ![]() [Edit 1 times, last edit by mdxi at Sep 1, 2020 10:46:46 PM] |
||
|
|
![]() |