Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 49
Posts: 49   Pages: 5   [ 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 73703 times and has 48 replies Next Thread
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Known Issue with Linux stuck workunits [Resolved]

http://www.worldcommunitygrid.org/forums/wcg/...33012_lastpage,yes#373862

There is a known issue with CFSW workunits on Linux potettially becoming stuck. Please see the known issues post linked above for the details. While we investigate we are limiting the number of CFSW workunits sent out to Linux computers to 1 per computer.

Thanks,
armstrdj

(edit: Changed to resolved -Uplinger)
----------------------------------------
[Edit 1 times, last edit by uplinger at May 10, 2012 8:28:04 PM]
[Apr 19, 2012 3:08:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Please note that these work units will show as Running in the BOINC manager. To check to see if they are stalled you will need to look at the system load or usage. An easy way to do this is to pull up a terminal and type in "top". This will show you the active processes on the computer and their cpu usage. If your computer has 4 cores and 4 wcgrid_* running at the top, then all is well. If you have less than 4 wcgrid_* running but expect to see more then there is an issue. As armstrdj mentioned in the known issues, you should be able to turn off Leave Application In Memory (LIAM) and suspend the work units. Then after about 1 minute, resume the work units.

Note there are other methods in finding if you have stuck work units, this should be the most common method across linux distros.

Thanks,
-Uplinger
[Apr 19, 2012 3:34:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I haven't run into this issue (as of yet) running either the beta or the production WU's 4 at one time in all cases.. Don't know if this will help but my only 4 core machine is a Phenom II 3.0 GHz/ 4 GB ram using Fedora 16. Below is the "top" output snipped.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23167 roger 39 19 61000 52m 1768 R 99.7 1.3 12:46.88 wcg_c4cw_lmps_6
23172 roger 39 19 61000 52m 1764 R 99.7 1.3 9:53.84 wcg_c4cw_lmps_6
2336 roger 39 19 236m 235m 1540 R 99.4 6.0 115:48.91 wcgrid_cfsw_bay
23015 roger 39 19 236m 235m 1532 R 98.7 5.9 77:43.90 wcgrid_cfsw_bay

Added: Well that wasn't much help... I ran out of C4SW WU's.
----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 19, 2012 6:31:46 PM]
[Apr 19, 2012 6:27:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Starbase, I'm happy you have not experienced the issue. We are hoping the problem is not wide spread and are working towards a fix currently.

Thanks,
-Uplinger
[Apr 19, 2012 6:42:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Hello uplinger.

From a stuck CFSW WU under Linux, just before resuming, do we turn LAIM back on, or leave it turned off?

Also:
1] 1-WU per computer regardless of the number of (true) cores of that computer?
2] What is the fresh-WU-issuance rule: quality-based, or time-based: 1-WU to be issued after the earlier WU was done and subsequently validated and without any issue, or 1-WU per 24hrs regardless of the how the earlier 1-WU went (with or without an issue)?
3] Will those machines which have crunched CFSW WUs thus-far without any issue still be subject to the 1-WU rule?

Thanks
;
----------------------------------------
[Edit 2 times, last edit by Former Member at Apr 19, 2012 7:18:26 PM]
[Apr 19, 2012 6:49:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
XSmeagolX
Senior Cruncher
Joined: Nov 12, 2009
Post Count: 444
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Is this the reason, why no Linux-WU are send out at the moment??
Some of my Team are receiving "World Community Grid 19-04-2012 18:24 No tasks are available for Computing for Sustainable Water " on their Linux-Systems
----------------------------------------
WCG-Team Captain of Team SETI.Germany

(official Partner of World Community Grid)

[Apr 19, 2012 6:57:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I also had no issue here running all cores on three different Linux systems, 4, 6, and 8 at a time multiple times. Ubuntu 11.06 on all three. Two AMD systems and one i7.
----------------------------------------

Distributed computing volunteer since September 27, 2000
[Apr 19, 2012 8:40:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

When I came home just now I discovered a stuck work unit on one of my Linux boxes. In the BOINC manager, there were two CFSW WUs appearing to be running concurrently (along with 2 WUs from other WCG sciences). One CFSW was fine. The other had incremented its elapsed time to 17 hrs but showed the last checkpoint at 3 hrs. Sure enough, top showed only 3 threads actually running.

I restarted the BOINC client. The elapsed time on the stuck WU went back down to 3 hrs, and now all 4 threads are running. I'll check back occasionally to make sure that percent completed is incrementing. (This is a dual-core Atom, so it takes a long time for anything to happen....)
----------------------------------------

[Apr 19, 2012 8:48:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jason1478963
Senior Cruncher
United States
Joined: Sep 18, 2005
Post Count: 295
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

uplinger: additional info you requested

It appears to me like there are 5 stuck wu's, each getting this message in boincview:
Host Project Date Message
Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM Task cfsw_0010_00010215_0 exited with zero status but no 'finished' file
Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM If this happens repeatedly you may need to reset the project.

I have tried suspending all other work, running only 1 stuck wu at a time, but they all come up with that error msg. I now have them suspended, running my last 6 tasks and they seem to run fine. But I have 96 hours of runtime in the 5 stuck ones, hate to abort them but I see no other choice. sad


ramjet@Ramjet-OctiCore5:~$ ps -ef | grep wcg
boinc 5098 963 98 Apr18 ? 23:37:19 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 5099 5098 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 5100 5099 0 Apr18 ? 00:00:18 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 12020 963 98 Apr18 ? 22:26:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12021 12020 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12022 12021 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12025 963 90 Apr18 ? 20:21:29 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12026 963 22 Apr18 ? 04:59:01 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12027 12025 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12028 963 4 Apr18 ? 01:04:23 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12029 12026 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12030 12027 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12031 963 0 Apr18 ? 00:12:12 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12032 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12033 12029 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12034 963 0 Apr18 ? 00:12:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12035 12028 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12036 12032 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12037 963 0 Apr18 ? 00:12:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12038 12034 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12039 12036 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12041 12035 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12043 12038 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12045 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
boinc 12046 12031 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12047 12046 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12051 12045 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
boinc 12052 12037 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12053 12052 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12054 12051 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
ramjet 12415 12398 0 16:24 pts/0 00:00:00 grep --color=auto wcg
ramjet@Ramjet-OctiCore5:~$
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Jason1478963 at Apr 20, 2012 12:49:43 AM]
[Apr 19, 2012 9:38:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1673
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Hi everybody,
I operate 3 Linux-based machines: Ubuntu 10.04 x64 LTS, Boinc 6.12.33.
All the machines (AMD Phenom II x6, Athlon II x2) operate correctly, being set as cfsw only.
Since I maintain a small buffer (0.5 day), the restriction causes a problem to me because the queue is running empty.
At the other side, HCMD2 causes several crashes during the last weeks on the Linux machines.
I can select another projects (an option for a Backup project would be very welcome) but since I am currently in a travel period, my ability to baby sit my machines is really limited.
Cheers,
Yves
----------------------------------------
[Apr 19, 2012 10:38:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 49   Pages: 5   [ 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread