World Community Grid - View Thread - "400 slots directories" work unit error

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: "400 slots directories" work unit error

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 38

[ ]

Author

This topic has been viewed 3916 times and has 37 replies

Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: "400 slots directories" work unit error

Hmm, and what if somebody has a machine with more that 400 threads? ;)

----------------------------------------

[Mar 25, 2014 4:10:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: "400 slots directories" work unit error

As was posted before in this thread, the multiplier is ncpus * 100 slots meaning if a device has 8 cores it can have 800 slots without problem. If a device has 400 threads, slot count possible would be, well use the calculator. wink

[Mar 25, 2014 4:25:08 PM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:

2 year badge for The Clean Energy Project - Phase 2

45 day badge for Drug Search for Leishmaniasis

45 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

45 day badge for FightAIDS@Home - Phase 2

14 day badge for Microbiome Immunity Project

5 year badge for OpenPandemics - COVID-19


Re: "400 slots directories" work unit error

Slots subdirectories are numbered 0 through 5, but contain many, many subdirectories (as shown in

this iPhone screenshot of Ranger text-based directory browser)

----------------------------------------

[Mar 26, 2014 2:23:42 AM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:


Re: "400 slots directories" work unit error

So yesterday I also attached rosetta@home due to these CEP2 shenanigans in the hopes of lessening the occurrence of this 400 slots error, but it has occurred again.

This time, it's telling me that the rosetta work units can't be started, which means that I've been posting information about potentially-innocent work units all along (my other rigs have no problem with Rosetta units).

Here are the CEP2 work units which were active at the time of the error message:

======== Tasks ========
1) -----------
   name: E220189_761_K.22.C16FH9N2SSeSi.00303272.1.set1d06_0
   WU name: E220189_761_K.22.C16FH9N2SSeSi.00303272.1.set1d06
   project URL: http://www.worldcommunitygrid.org/
   report deadline: Tue Apr  1 04:05:55 2014
   ready to report: no
   got server ack: no
   final CPU time: 0.000000
   state: downloaded
   scheduler state: preempted
   exit_status: 0
   signal: 0
   suspended via GUI: no
   active_task_state: SUSPENDED
   app version num: 640
   checkpoint CPU time: 10235.308969
   current CPU time: 11137.662200
   fraction done: 0.171877
   swap size: 371490816.000000
   working set size: 182804480.000000
   estimated CPU time remaining: 15115.875905
2) -----------
   name: E220224_281_K.22.C17FH13N2Si2.00249792.1.set1d06_0
   WU name: E220224_281_K.22.C17FH13N2Si2.00249792.1.set1d06
   project URL: http://www.worldcommunitygrid.org/
   report deadline: Wed Apr  2 03:36:09 2014
   ready to report: no
   got server ack: no
   final CPU time: 0.000000
   state: downloaded
   scheduler state: preempted
   exit_status: 0
   signal: 0
   suspended via GUI: no
   active_task_state: SUSPENDED
   app version num: 640
   checkpoint CPU time: 8304.195829
   current CPU time: 9078.102400
   fraction done: 0.140094
   swap size: 382177280.000000
   working set size: 153739264.000000
   estimated CPU time remaining: 16692.658237
3) -----------
   name: E220223_464_K.23.C19FH11N2S.00307929.3.set1d06_0
   WU name: E220223_464_K.23.C19FH11N2S.00307929.3.set1d06
   project URL: http://www.worldcommunitygrid.org/
   report deadline: Wed Apr  2 03:36:09 2014
   ready to report: no
   got server ack: no
   final CPU time: 0.000000
   state: downloaded
   scheduler state: preempted
   exit_status: 0
   signal: 0
   suspended via GUI: no
   active_task_state: SUSPENDED
   app version num: 640
   checkpoint CPU time: 6429.009351
   current CPU time: 7012.239300
   fraction done: 0.108214
   swap size: 385277952.000000
   working set size: 190152704.000000
   estimated CPU time remaining: 18458.219960

----------------------------------------

[Mar 26, 2014 11:49:05 AM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:


Re: "400 slots directories" work unit error

Also, oddly enough this problem appears to repeat every day shortly after my network availability window opens (open from 3a to 5a)

----------------------------------------

[Mar 27, 2014 12:08:38 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: "400 slots directories" work unit error

To me it's something that makes what boinc thinks based on what is in memory and not the true state on the disc, or vice versa. Directory structure information not updated fast enough. Networking takes a bunch of cpu use, although one would think a subsystem would deal with that. Set the cc_config to only allow 1 thread uploading or downloading at any time. Your 2 hour window would be enough, also concurrent uploading of cep2 to harvard does not result in any time gain at the end, the overhead could be making it even slower. For instance try the config options:

<max_file_xfers>2</max_file_xfers>
<max_file_xfers_per_project>1</max_file_xfers_per_project>

Think you resolved the issue of not uploading/fetching/reporting until 30 minutes before net close. That's a boinc feature btw to then do it immediately until T minus zero. Reporting is in latest test agents hardcoded to happen at least every 60 minutes, no more up to 24 hour waiting. This applies to both netowrk and cpu scheduling.

[Mar 27, 2014 12:26:54 PM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:


Re: "400 slots directories" work unit error

lava,

Does your suggestion remain the same even though the project updates (both uploads and downloads) complete successfully every morning?

----------------------------------------

[Mar 27, 2014 2:29:53 PM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:


Re: "400 slots directories" work unit error

Congratulations for being the first with that error message. I suggest that you reboot and see if you can get it again. If so, please post the first 50 or so lines in your event log so that everybody can see what sort of system you have.

Lawrence,
Since this thread began, I have rebooted my system, and the error has recurred (my more recent posts) after this reboot.

After running (on Arch Linux)

sudo systemctl restart boinc.service

The following is displayed:

[user@system ~]$ boinccmd --get_messages
1: 27-Mar-2014 09:33:20 (low) [] cc_config.xml not found - using defaults
2: 27-Mar-2014 09:33:20 (low) [] Starting BOINC client version 7.2.42 for x86_64-pc-linux-gnu
3: 27-Mar-2014 09:33:20 (low) [] log flags: file_xfer, sched_ops, task
4: 27-Mar-2014 09:33:20 (low) [] Libraries: libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libssh2/1.4.3
5: 27-Mar-2014 09:33:20 (low) [] Data directory: /var/lib/boinc
6: 27-Mar-2014 09:33:20 (low) [] No usable GPUs found
7: 27-Mar-2014 09:33:20 (low) [] Host name: Archer
8: 27-Mar-2014 09:33:20 (low) [] Processor: 4 GenuineIntel Intel(R) Core(TM) i5-3470T CPU @ 2.90GHz [Family 6 Model 58 Stepping 9]
9: 27-Mar-2014 09:33:20 (low) [] Processor features: fpu vme de pse tsc msr pae mce
 cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm 
pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes 
xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi 
flexpriority ept vpid fsgsbase smep erms
10: 27-Mar-2014 09:33:20 (low) [] OS: Linux: 3.13.6-1-ARCH
11: 27-Mar-2014 09:33:20 (low) [] Memory: 11.64 GB physical, 1024.00 MB virtual
12: 27-Mar-2014 09:33:20 (low) [] Disk: 54.90 GB total, 9.42 GB free
13: 27-Mar-2014 09:33:20 (low) [] Local time is UTC -5 hours
14: 27-Mar-2014 09:33:20 (low) [rosetta@home] URL http://boinc.bakerlab.org/rosetta/; Computer ID 1751517; resource share 100
15: 27-Mar-2014 09:33:20 (low) [World Community Grid] URL http://www.worldcommunitygrid.org/; Computer ID 2757007; resource share 100
16: 27-Mar-2014 09:33:20 (low) [World Community Grid] General prefs: from World Community Grid (last modified 23-Feb-2014 17:04:20)
17: 27-Mar-2014 09:33:20 (low) [World Community Grid] Computer location: home
18: 27-Mar-2014 09:33:20 (low) [] General prefs: using separate prefs for home
19: 27-Mar-2014 09:33:20 (low) [] Preferences:
20: 27-Mar-2014 09:33:20 (low) [] max memory usage when active: 10727.47MB
21: 27-Mar-2014 09:33:20 (low) [] max memory usage when idle: 11919.41MB
22: 27-Mar-2014 09:33:20 (low) [] max disk usage: 11.97GB
23: 27-Mar-2014 09:33:20 (low) [] don't use GPU while active
24: 27-Mar-2014 09:33:20 (low) [] (to change preferences, visit a project web site or select Preferences in the Manager)
25: 27-Mar-2014 09:33:20 (low) [] Not using a proxy
26: 27-Mar-2014 09:33:24 (low) [] Running CPU benchmarks
27: 27-Mar-2014 09:33:24 (low) [] Suspending computation - CPU benchmarks in progress
28: 27-Mar-2014 09:33:24 (low) [] Suspending network activity - time of day
29: 27-Mar-2014 09:33:56 (low) [] Benchmark results:
30: 27-Mar-2014 09:33:56 (low) [] Number of CPUs: 4
31: 27-Mar-2014 09:33:56 (low) [] 3207 floating point MIPS (Whetstone) per CPU
32: 27-Mar-2014 09:33:56 (low) [] 12803 integer MIPS (Dhrystone) per CPU

Also, immediately after restarting the boinc client, the 400 slots directories errors resume:

33: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
34: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] Can't create task for bc000060_fold_SAVE_ALL_OUT_155248_441_0
35: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
36: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] Can't create task for yrssfrv2d3_3_fold_SAVE_ALL_OUT_155176_2526_0
37: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
38: 27-Mar-2014 09:33:57 (internal error) [rosetta@home] [error] Can't create task for foldit_997258_1018_fold_SAVE_ALL_OUT_155433_1132_1
39: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
40: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] Can't create task for bc000060_fold_SAVE_ALL_OUT_155248_441_0
41: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
42: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] Can't create task for yrssfrv2d3_3_fold_SAVE_ALL_OUT_155176_2526_0
43: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
44: 27-Mar-2014 09:34:57 (internal error) [rosetta@home] [error] Can't create task for foldit_997258_1018_fold_SAVE_ALL_OUT_155433_1132_1
45: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
46: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] Can't create task for bc000060_fold_SAVE_ALL_OUT_155248_441_0
47: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
48: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] Can't create task for yrssfrv2d3_3_fold_SAVE_ALL_OUT_155176_2526_0
49: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
50: 27-Mar-2014 09:35:58 (internal error) [rosetta@home] [error] Can't create task for foldit_997258_1018_fold_SAVE_ALL_OUT_155433_1132_1
51: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
52: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] Can't create task for bc000060_fold_SAVE_ALL_OUT_155248_441_0
53: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
54: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] Can't create task for yrssfrv2d3_3_fold_SAVE_ALL_OUT_155176_2526_0
55: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] exceeded limit of 400 slot directories
56: 27-Mar-2014 09:36:58 (internal error) [rosetta@home] [error] Can't create task for foldit_997258_1018_fold_SAVE_ALL_OUT_155433_1132_1

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by Mgruben at Mar 27, 2014 2:39:12 PM]

[Mar 27, 2014 2:35:07 PM]

Mgruben
Advanced Cruncher
Joined: May 26, 2013
Post Count: 94
Status: Offline
Project Badges:


Re: "400 slots directories" work unit error

When you look in the /boinc/slots place now, does it show that many i.e. slots/399 as the highest? If not do slots plus sub-directories there off add up to this number? As lawrenceharding commented, not seen here before your report, there's something special about your system. Is it caching the disc structures and not writing the updates to disc? Look at write to disc delays. If there's a cache-flush command in linux, run that.

The "something special" may be that my /var/lib/boinc/slots directory resides on a 7.7GB RAMdisk allocation, though I'm personally not seeing how that would be relevant

[user@system ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        55G   43G  9.5G  82% /
dev             5.9G     0  5.9G   0% /dev
run             5.9G  360K  5.9G   1% /run
tmpfs           5.9G   47M  5.8G   1% /dev/shm
tmpfs           5.9G     0  5.9G   0% /sys/fs/cgroup
tmpfs           5.9G     0  5.9G   0% /tmp
none            7.7G  2.9G  4.8G  38% /var/lib/boinc/slots

[user@system ~]$ cat /etc/fstab 
# 
# /etc/fstab: static file system information
#
# <file system> <dir>   <type>  <options>       <dump>  <pass>
# /dev/sda1
UUID=xxxnotrelevantxxx       /               ext4            defaults,noatime,discard        0 1
none                                     /var/lib/boinc/slots   tmpfs    nodev,nosuid,noexec,nodiratime,size=7783M 0 0

# /swapfile
/swapfile                                       none            swap            defaults                        0 0

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by Mgruben at Mar 27, 2014 2:45:58 PM]

[Mar 27, 2014 2:43:23 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: "400 slots directories" work unit error

gruby, yes as it's to me evident your subject linux system is not able to keep up in some way. Setting the traps can eliminate possibilities, also because you now added for it to happen during the networking window specifically.

In a previous post also mentioned write delays, cache flushing. Maybe your disk subsystem needs investigating. Is the controller doing fine for instance? Otherwise, suggest you carry this riddle to the developers. There's a lot of very knowing people on the alpha mail list.

[Mar 27, 2014 2:46:27 PM]

[ ]