Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: "Computation error" with linux kernel > 3.6 ? |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 29
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Someone posted a fix on the WCG forums [I could not get to work for myself on Ubuntu], which entailed forcing 127.0.0.1 traffic the roundabout way to stop those crashes [the hosts file was involved in that]. It's inherently a BOINC core client issue where the network part of BOINC goes berzerk when it can't find the router, and mostly a WIFI related issue. No issue with Ethernet.
|
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7219 Status: Offline Project Badges: |
Someone posted a fix on the WCG forums [I could not get to work for myself on Ubuntu], which entailed forcing 127.0.0.1 traffic the roundabout way to stop those crashes [the hosts file was involved in that]. It's inherently a BOINC core client issue where the network part of BOINC goes berzerk when it can't find the router, and mostly a WIFI related issue. No issue with Ethernet. From my experience it is definitely a WIFI issue. I have never had it on ethernet connected devices. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7219 Status: Offline Project Badges: |
Another thing that can cause the signal 11 error is if BOINC attempts to access the Internet and the connection is down. Any running WUs of certain kinds will fail when that happens. I may have found the problem, but will need to test. I have 7 machines with 26 cores all going through a range extender to my wireless router (g). I did not have any trouble with this until I put the last machine on. I think I may just have overloaded the connection from time to time. I may just be trying to cram too much information down too small a pipe. I could upgrade to N - wireless or I could figure out a way for the machines to all take turns so to speak. For the time being I am just manually downloading one machine - an 8 core - manually once or twice a day. So far so good. I will be in search of a more elegant solution. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
We're talking Linux/Ubuntu off course. Never have I seen this on Windows.
The workaround, if you think it's congestive load is, to set an offset connect interval. The Ubuntu instance gets 30 minutes 1 hour before EOD [On v7 this automatically forces reporting clearing Ready to Report too (RtR) **]. Set other machines to scheduled networking for different time segments. ** As of a next version, above 7.0.39, the maximum RtR that can build before clearing will be set to 64. This is with e.g. the short task GPU crunchers in mind. Those that manage a machine running 12 concurrent on a 7950 could have that within the hour. Reduces size of scheduler hit as server side benefit. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7219 Status: Offline Project Badges: |
We're talking Linux/Ubuntu off course. Never have I seen this on Windows. Yes, Linux Mint 12. Thanks Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7219 Status: Offline Project Badges: |
We're talking Linux/Ubuntu off course. Never have I seen this on Windows. The workaround, if you think it's congestive load is, to set an offset connect interval. The Ubuntu instance gets 30 minutes 1 hour before EOD [On v7 this automatically forces reporting clearing Ready to Report too (RtR) **]. Set other machines to scheduled networking for different time segments. ** As of a next version, above 7.0.39, the maximum RtR that can build before clearing will be set to 64. This is with e.g. the short task GPU crunchers in mind. Those that manage a machine running 12 concurrent on a 7950 could have that within the hour. Reduces size of scheduler hit as server side benefit. I set the connect times for the two 8 core machines to separate connection times, one from 00:01 to 11:00 and the other from 13:00 to 23:00. I gave each one a 1 day queue. So far this seems to have solved the problem so I am guessing is was congestion related. It did not affect any of the other linux systems, but they are all quite bit slower. It would be a nice feature if it was possible to set more than one window per machine, but this works so I should probably be satisfied. Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at Dec 4, 2012 7:47:56 PM] |
||
|
B2I
Senior Cruncher usa Joined: Jan 23, 2011 Post Count: 232 Status: Offline Project Badges: |
I had an issue like this on my ubuntu 12.04 machine. I know this is not a pleasing cure but when I mixed in a lot of different WUs so the computer was only running 2 or 3 CEP WUs at a time, all the "computational errors" stopped.
---------------------------------------- |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7219 Status: Offline Project Badges: |
I had an issue like this on my ubuntu 12.04 machine. I know this is not a pleasing cure but when I mixed in a lot of different WUs so the computer was only running 2 or 3 CEP WUs at a time, all the "computational errors" stopped. I have one of the octo machines doing 2 CEP2 units and the rest HCC and the other one doing FAAH and HFCC. For me I do not believe the issue was CEP2, but the issue was overloading my wireless connection. The issue did not happen until I added the second octo machine to the mix. I am glad you were able to cure your problem. Evidently there is a number of items which may be the cause. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
litimetal
Cruncher Joined: Jan 9, 2011 Post Count: 5 Status: Offline Project Badges: |
I got a similar problem on Clean Water Project(I'm using Fedora)
----------------------------------------https://secure.worldcommunitygrid.org/forums/...ead_thread,34533_lastpage |
||
|
|