Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 21
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3154 times and has 20 replies Next Thread
Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
200 results showing 'ERROR' strange for a machine that does not do errors!

Here is a list f the wu's that errored. Running on a Ubuntu server 12.04
E224185_ 235_ I.64.C50H26N6O8.00395927.2.set1d06_ 3-- 	MegaCruncher 	Error 	31/07/14 20:14:45 	01/08/14 15:40:12 	7.29 / 7.73 	147.9 / 0.0
E223908_ 111_ K.23.C19FH13OSSi.01207739.2.set1d06_ 3-- MegaCruncher Error 31/07/14 20:14:45 01/08/14 15:40:12 7.09 / 7.73 147.9 / 0.0
E224149_ 410_ I.63.C47H24N6O9S.00071083.3.set1d06_ 5-- MegaCruncher Error 31/07/14 20:10:20 01/08/14 15:40:12 7.39 / 7.80 149.3 / 0.0
E224229_ 407_ I.65.C52H30N6O6S.00127412.1.set1d06_ 0-- MegaCruncher Error 31/07/14 19:09:04 01/08/14 15:40:12 0.56 / 0.65 12.4 / 0.0
E224229_ 329_ I.65.C52H30N6O6S.00037551.2.set1d06_ 0-- MegaCruncher Error 31/07/14 19:06:57 01/08/14 15:40:12 0.73 / 0.83 15.9 / 0.0
E224229_ 086_ I.65.C55H30N6O3S.00049249.0.set1d06_ 0-- MegaCruncher Error 31/07/14 19:04:47 01/08/14 15:40:12 0.85 / 0.91 17.4 / 0.0
E224228_ 451_ I.68.C50F4H22N6O8.00085443.4.set1d06_ 0-- MegaCruncher Error 31/07/14 19:02:37 01/08/14 15:40:12 0.99 / 1.09 20.8 / 0.0
E224228_ 121_ I.68.C55F6H22N6O.00414950.1.set1d06_ 0-- MegaCruncher Error 31/07/14 19:00:25 01/08/14 15:40:12 1.12 / 1.22 23.3 / 0.0
E224228_ 904_ I.65.C55H30N6O3S.00116383.3.set1d06_ 0-- MegaCruncher Error 31/07/14 18:58:14 01/08/14 15:40:12 1.18 / 1.24 23.7 / 0.0
E224228_ 684_ I.65.C50H30N8O6S.00210505.1.set1d06_ 0-- MegaCruncher Error 31/07/14 18:56:06 01/08/14 15:40:12 1.14 / 1.24 23.7 / 0.0
E224228_ 954_ I.66.C52F6H28N6O2.00299723.2.set1d06_ 0-- MegaCruncher Error 31/07/14 18:53:55 01/08/14 15:40:12 1.10 / 1.25 23.9 / 0.0
E224226_ 255_ I.68.C54F4H22N6O4.00055472.2.set1d06_ 1-- MegaCruncher Error 31/07/14 18:51:45 01/08/14 15:40:12 1.19 / 1.27 24.4 / 0.0
E224171_ 663_ I.62.C50H30N6O6.00218860.4.set1d06_ 5-- Huntress Error 31/07/14 18:51:05 31/07/14 21:41:58 0.87 / 0.99 22.5 / 0.0
E224228_ 243_ I.67.C54H24N8O4S.00352173.0.set1d06_ 0-- MegaCruncher Error 31/07/14 18:49:34 01/08/14 15:40:12 1.20 / 1.28 24.6 / 0.0
E224228_ 775_ I.66.C56H28N8O2.00260202.3.set1d06_ 1-- MegaCruncher Error 31/07/14 18:47:25 01/08/14 15:40:12 1.16 / 1.30 24.8 / 0.0


Result Log

Result Name: E224185_ 235_ I.64.C50H26N6O8.00395927.2.set1d06_ 3--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[21:16:15] Number of jobs = 16
[21:16:15] Starting job 0,CPU time has been restored to 0.000000.
[21:16:15] Starting new Job
[21:16:15] Qink name = fldman
[21:16:17] Qink name = gesman
[21:16:17] Qink name = scfman
[21:42:33] Qink name = anlman
[21:43:16] End of Job
[21:43:17] Finished Job #0
[21:43:17] Starting job 1,CPU time has been restored to 1505.040000.
[21:43:17] Starting new Job
[21:43:17] Qink name = fldman
[21:43:24] Qink name = gesman
[21:43:25] Qink name = scfman
[22:36:38] Qink name = anlman
[22:59:38] End of Job
[22:59:40] Finished Job #1
[22:59:40] Starting job 2,CPU time has been restored to 5795.024000.
[22:59:41] Starting new Job
[22:59:41] Qink name = fldman
[22:59:47] Qink name = gesman
[22:59:48] Qink name = scfman
[23:41:17] Qink name = anlman
[23:41:18] Qink name = drvman
[23:51:01] Qink name = optman
[23:51:04] Qink name = fldman
[23:51:04] Qink name = gesman
[23:51:13] Qink name = scfman
[00:58:22] Qink name = anlman
[00:58:22] Qink name = drvman
[01:07:49] Qink name = optman
[01:07:51] Qink name = fldman
[01:07:51] Qink name = gesman
[01:07:59] Qink name = scfman
[02:15:31] Qink name = anlman
[02:15:31] Qink name = drvman
[02:25:24] Qink name = optman
[02:25:28] Qink name = fldman
[02:25:28] Qink name = gesman
[02:25:39] Qink name = scfman
[03:33:02] Qink name = anlman
[03:33:06] Qink name = drvman
[03:43:19] Qink name = optman
[03:43:21] Qink name = fldman
[03:43:21] Qink name = gesman
[03:43:28] Qink name = scfman
[04:45:44] Qink name = anlman
[04:45:51] Qink name = drvman
[04:55:47] Qink name = optman
[04:55:51] Qink name = fldman
[04:55:51] Qink name = gesman
[04:56:00] Qink name = scfman

</stderr_txt>
]]>
Close

Return to Top


----------------------------------------
Mamajuanauk is the Name! Crunching is the Game!



----------------------------------------
[Edit 1 times, last edit by Mamajuanauk at Aug 1, 2014 4:10:28 PM]
[Aug 1, 2014 4:07:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
captainjack
Advanced Cruncher
Joined: Apr 14, 2008
Post Count: 144
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

Welcome to the club. The Signal 11 error has been around for a while on Linux systems. It happens when BOINC gets lost for a bit then thinks something is wrong with the system and starts cancelling jobs. Sometimes it happens when the CPU gets really busy with something else and BOINC doesn't get any attention. The recommended solution for that problem is to go into Device Settings and set the option for "Suspend work if CPU usage is above" to 35%.

I have also seen it happen when BOINC gets lost on the Internet trying to talk to the host service (WCG in this case). I used to see these when there were DHCP problems on my ISP. Not much we could do about that one.

The other thing you might try is to upgrade to the latest version of Ubuntu. I'm running 14.04 now and haven't seen a Signal 11 error in quite a while.

Hope that helps.
[Aug 1, 2014 4:34:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

Welcome to the club. The Signal 11 error has been around for a while on Linux systems. It happens when BOINC gets lost for a bit then thinks something is wrong with the system and starts cancelling jobs. Sometimes it happens when the CPU gets really busy with something else and BOINC doesn't get any attention. The recommended solution for that problem is to go into Device Settings and set the option for "Suspend work if CPU usage is above" to 35%.

I have also seen it happen when BOINC gets lost on the Internet trying to talk to the host service (WCG in this case). I used to see these when there were DHCP problems on my ISP. Not much we could do about that one.

The other thing you might try is to upgrade to the latest version of Ubuntu. I'm running 14.04 now and haven't seen a Signal 11 error in quite a while.

Hope that helps.
Many thanks Jack. tha system only runs WCG nothing else, I will try upgrading though...
----------------------------------------
Mamajuanauk is the Name! Crunching is the Game!



[Aug 1, 2014 4:56:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

Were you not there before, too many cep2 running, trying to start concurrently? The storage access is the bottleneck, much less pronounced on windows. Once again this highlights the need for staggered starting and resuming, a trac development ticket raised bij keithing reed over a year ago.
[Aug 1, 2014 5:18:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

Were you not there before, too many cep2 running, trying to start concurrently? The storage access is the bottleneck, much less pronounced on windows. Once again this highlights the need for staggered starting and resuming, a trac development ticket raised bij keithing reed over a year ago.
Yep, I was there sometime ago, however, things have been running smoothly for ages. Strange this problem repeats just when some new libraries have hit the grid...
----------------------------------------
Mamajuanauk is the Name! Crunching is the Game!



[Aug 1, 2014 6:37:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

To add, the tasks at this time are rather short, so the other weakness, nic/wifi/zipping load comes into play as well. Measuring 2 hours average now when last week we were doing close to 6 hours per task. There's that critical job #2 and the very taxing setups just before that to top it up. The bell tolls, and no one at wcg is hearing it, it's that second monkey of 4 from the far east, this one

[Aug 1, 2014 6:45:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

With regards to the length of time that the jobs run, the jobs basically scale with the square of the number of electrons (i.e. doubling the size will result in a job that takes 4 times as long). Yes, the initial jobs in this library are short, but they will get longer very quickly, and we are looking to place the majority of jobs we send out in a range which is 'grid friendly'. The reason for the fewer number of jobs per work unit is to allow us to crunch larger, more exciting molecules - which, after all, is the name of the game!

Your Harvard CEP Team
[Aug 2, 2014 1:14:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

We are aware of the issues and working on a solution. Please monitor this thread for updates
https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,37056

Thanks,
armstrdj
[Aug 2, 2014 5:41:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

I am working on setting all results that came back with this error for recheck. Please be patient as this could take some time to clean up. I will post when we believe all work units have been rechecked with the fixed up validator.

Thanks,
-Uplinger
[Aug 2, 2014 6:38:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 200 results showing 'ERROR' strange for a machine that does not do errors!

We would like to give huge props to the great guys at IBM for sorting this out so quickly - you guys are amazing!
Your Harvard CEP Team
[Aug 2, 2014 9:16:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread