Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 49
|
![]() |
Author |
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
http://www.worldcommunitygrid.org/forums/wcg/...33012_lastpage,yes#373862
----------------------------------------There is a known issue with CFSW workunits on Linux potettially becoming stuck. Please see the known issues post linked above for the details. While we investigate we are limiting the number of CFSW workunits sent out to Linux computers to 1 per computer. Thanks, armstrdj (edit: Changed to resolved -Uplinger) [Edit 1 times, last edit by uplinger at May 10, 2012 8:28:04 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Please note that these work units will show as Running in the BOINC manager. To check to see if they are stalled you will need to look at the system load or usage. An easy way to do this is to pull up a terminal and type in "top". This will show you the active processes on the computer and their cpu usage. If your computer has 4 cores and 4 wcgrid_* running at the top, then all is well. If you have less than 4 wcgrid_* running but expect to see more then there is an issue. As armstrdj mentioned in the known issues, you should be able to turn off Leave Application In Memory (LIAM) and suspend the work units. Then after about 1 minute, resume the work units.
Note there are other methods in finding if you have stuck work units, this should be the most common method across linux distros. Thanks, -Uplinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I haven't run into this issue (as of yet) running either the beta or the production WU's 4 at one time in all cases.. Don't know if this will help but my only 4 core machine is a Phenom II 3.0 GHz/ 4 GB ram using Fedora 16. Below is the "top" output snipped.
----------------------------------------PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23167 roger 39 19 61000 52m 1768 R 99.7 1.3 12:46.88 wcg_c4cw_lmps_6 23172 roger 39 19 61000 52m 1764 R 99.7 1.3 9:53.84 wcg_c4cw_lmps_6 2336 roger 39 19 236m 235m 1540 R 99.4 6.0 115:48.91 wcgrid_cfsw_bay 23015 roger 39 19 236m 235m 1532 R 98.7 5.9 77:43.90 wcgrid_cfsw_bay Added: Well that wasn't much help... I ran out of C4SW WU's. [Edit 1 times, last edit by Former Member at Apr 19, 2012 6:31:46 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Starbase, I'm happy you have not experienced the issue. We are hoping the problem is not wide spread and are working towards a fix currently.
Thanks, -Uplinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello uplinger.
----------------------------------------From a stuck CFSW WU under Linux, just before resuming, do we turn LAIM back on, or leave it turned off? Also: 1] 1-WU per computer regardless of the number of (true) cores of that computer? 2] What is the fresh-WU-issuance rule: quality-based, or time-based: 1-WU to be issued after the earlier WU was done and subsequently validated and without any issue, or 1-WU per 24hrs regardless of 3] Will those machines which have crunched CFSW WUs thus-far without any issue still be subject to the 1-WU rule? Thanks ; [Edit 2 times, last edit by Former Member at Apr 19, 2012 7:18:26 PM] |
||
|
XSmeagolX
Senior Cruncher Joined: Nov 12, 2009 Post Count: 444 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Is this the reason, why no Linux-WU are send out at the moment??
----------------------------------------Some of my Team are receiving "World Community Grid 19-04-2012 18:24 No tasks are available for Computing for Sustainable Water " on their Linux-Systems
WCG-Team Captain of Team SETI.Germany
![]() (official Partner of World Community Grid) ![]() |
||
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
I also had no issue here running all cores on three different Linux systems, 4, 6, and 8 at a time multiple times. Ubuntu 11.06 on all three. Two AMD systems and one i7.
----------------------------------------![]() Distributed computing volunteer since September 27, 2000 |
||
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
When I came home just now I discovered a stuck work unit on one of my Linux boxes. In the BOINC manager, there were two CFSW WUs appearing to be running concurrently (along with 2 WUs from other WCG sciences). One CFSW was fine. The other had incremented its elapsed time to 17 hrs but showed the last checkpoint at 3 hrs. Sure enough, top showed only 3 threads actually running.
----------------------------------------I restarted the BOINC client. The elapsed time on the stuck WU went back down to 3 hrs, and now all 4 threads are running. I'll check back occasionally to make sure that percent completed is incrementing. (This is a dual-core Atom, so it takes a long time for anything to happen....) ![]() |
||
|
Jason1478963
Senior Cruncher United States Joined: Sep 18, 2005 Post Count: 295 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
uplinger: additional info you requested
----------------------------------------It appears to me like there are 5 stuck wu's, each getting this message in boincview: Host Project Date Message Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM Task cfsw_0010_00010215_0 exited with zero status but no 'finished' file Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM If this happens repeatedly you may need to reset the project. I have tried suspending all other work, running only 1 stuck wu at a time, but they all come up with that error msg. I now have them suspended, running my last 6 tasks and they seem to run fine. But I have 96 hours of runtime in the 5 stuck ones, hate to abort them but I see no other choice. ![]() ramjet@Ramjet-OctiCore5:~$ ps -ef | grep wcg boinc 5098 963 98 Apr18 ? 23:37:19 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 5099 5098 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 5100 5099 0 Apr18 ? 00:00:18 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 12020 963 98 Apr18 ? 22:26:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12021 12020 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12022 12021 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12025 963 90 Apr18 ? 20:21:29 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12026 963 22 Apr18 ? 04:59:01 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12027 12025 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12028 963 4 Apr18 ? 01:04:23 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12029 12026 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12030 12027 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12031 963 0 Apr18 ? 00:12:12 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12032 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12033 12029 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12034 963 0 Apr18 ? 00:12:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12035 12028 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12036 12032 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12037 963 0 Apr18 ? 00:12:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12038 12034 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12039 12036 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12041 12035 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12043 12038 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12045 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 boinc 12046 12031 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12047 12046 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12051 12045 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 boinc 12052 12037 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12053 12052 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12054 12051 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 ramjet 12415 12398 0 16:24 pts/0 00:00:00 grep --color=auto wcg ramjet@Ramjet-OctiCore5:~$ ![]() [Edit 1 times, last edit by Jason1478963 at Apr 20, 2012 12:49:43 AM] |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi everybody,
----------------------------------------I operate 3 Linux-based machines: Ubuntu 10.04 x64 LTS, Boinc 6.12.33. All the machines (AMD Phenom II x6, Athlon II x2) operate correctly, being set as cfsw only. Since I maintain a small buffer (0.5 day), the restriction causes a problem to me because the queue is running empty. At the other side, HCMD2 causes several crashes during the last weeks on the Linux machines. I can select another projects (an option for a Backup project would be very welcome) but since I am currently in a travel period, my ability to baby sit my machines is really limited. Cheers, Yves |
||
|
|
![]() |