Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 57
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Now that you mention it...
----------------------------------------7/9/2012 3:47:38 PM | World Community Grid | Computation for task SN2S_2X8L_1000226_0515_0 finished 7/9/2012 3:47:38 PM | World Community Grid | Starting task SN2S_2X8L_1000228_0607_0 using sn2s version 620 in slot 7 7/9/2012 3:47:41 PM | World Community Grid | Started upload of SN2S_2X8L_1000226_0515_0_0 7/9/2012 3:48:03 PM | World Community Grid | Temporarily failed upload of SN2S_2X8L_1000226_0515_0_0: connect() failed 7/9/2012 3:48:03 PM | World Community Grid | Backing off 3 min 0 sec on upload of SN2S_2X8L_1000226_0515_0_0 7/9/2012 3:48:06 PM | | Project communication failed: attempting access to reference site 7/9/2012 3:48:08 PM | | Internet access OK - project servers may be temporarily down. 7/9/2012 3:50:27 PM | World Community Grid | Computation for task SN2S_2X8L_1000225_0264_1 finished 7/9/2012 3:50:27 PM | World Community Grid | Starting task SN2S_2X8L_1000229_0340_0 using sn2s version 620 in slot 10 7/9/2012 3:50:29 PM | World Community Grid | Started upload of SN2S_2X8L_1000225_0264_1_0 7/9/2012 3:55:04 PM | World Community Grid | [error] Error reported by file upload server: Maintenance underway: file uploads are temporarily disabled. 7/9/2012 3:55:04 PM | World Community Grid | Temporarily failed upload of SN2S_2X8L_1000225_0264_1_0: transient upload error 7/9/2012 3:55:04 PM | World Community Grid | Backing off 3 min 26 sec on upload of SN2S_2X8L_1000225_0264_1_0 7/9/2012 3:55:04 PM | World Community Grid | Started upload of SN2S_2X8L_1000226_0515_0_0 7/9/2012 3:55:05 PM | World Community Grid | [error] Error reported by file upload server: Maintenance underway: file uploads are temporarily disabled. 7/9/2012 3:55:05 PM | World Community Grid | Temporarily failed upload of SN2S_2X8L_1000226_0515_0_0: transient upload error 7/9/2012 3:55:05 PM | World Community Grid | Backing off 6 min 16 sec on upload of SN2S_2X8L_1000226_0515_0_0 also on a different computer... Mon 09 Jul 2012 03:54:30 PM EDT World Community Grid Started download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.job Mon 09 Jul 2012 03:54:53 PM EDT Project communication failed: attempting access to reference site Mon 09 Jul 2012 03:54:53 PM EDT World Community Grid Temporarily failed download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.job: connect() failed Mon 09 Jul 2012 03:54:53 PM EDT World Community Grid Backing off 1 min 0 sec on download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.job Mon 09 Jul 2012 03:54:53 PM EDT World Community Grid Started download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.zip Mon 09 Jul 2012 03:54:54 PM EDT Internet access OK - project servers may be temporarily down. Mon 09 Jul 2012 03:55:15 PM EDT Project communication failed: attempting access to reference site Mon 09 Jul 2012 03:55:15 PM EDT World Community Grid Temporarily failed download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.zip: connect() failed Mon 09 Jul 2012 03:55:15 PM EDT World Community Grid Backing off 1 min 0 sec on download of SN2S_2X8L_1000234_0274_SN2S_2X8L_1000234_0274.zip Mon 09 Jul 2012 03:55:17 PM EDT Internet access OK - project servers may be temporarily down. [Edit 1 times, last edit by Former Member at Jul 9, 2012 7:58:44 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi knreed.
I'm not seen any more problems here, thanks. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My last commo error:
7/8/2012 7:37:21 PM World Community Grid Temporarily failed upload of E208569_806_C.29.C25H14N2OS.02254685.0.set1d06_1_4: HTTP error |
||
|
Bugg
Senior Cruncher USA Joined: Nov 19, 2006 Post Count: 271 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, I just got this earlier today while I was at work:
----------------------------------------7/9/2012 2:52:27 PM | World Community Grid | Sending scheduler request: To fetch work. 7/9/2012 2:52:27 PM | World Community Grid | Requesting new tasks for CPU 7/9/2012 2:52:28 PM | World Community Grid | Temporarily failed upload of c4cw_target06_075850701_0_0: HTTP error 7/9/2012 2:52:28 PM | World Community Grid | Backing off 13 min 42 sec on upload of c4cw_target06_075850701_0_0 7/9/2012 2:52:43 PM | | Project communication failed: attempting access to reference site 7/9/2012 2:52:45 PM | | Internet access OK - project servers may be temporarily down. 7/9/2012 2:52:49 PM | World Community Grid | Scheduler request failed: Couldn't connect to server 7/9/2012 2:52:52 PM | | Project communication failed: attempting access to reference site 7/9/2012 2:52:53 PM | | Internet access OK - project servers may be temporarily down. 7/9/2012 2:54:00 PM | World Community Grid | Sending scheduler request: To fetch work. 7/9/2012 2:54:00 PM | World Community Grid | Requesting new tasks for CPU 7/9/2012 2:54:22 PM | World Community Grid | Scheduler request failed: Couldn't connect to server 7/9/2012 2:54:37 PM | | Project communication failed: attempting access to reference site 7/9/2012 2:54:38 PM | | Internet access OK - project servers may be temporarily down. 7/9/2012 2:55:19 PM | World Community Grid | Computation for task c4cw_target06_075840594_0 finished 7/9/2012 2:55:19 PM | World Community Grid | Starting task c4cw_target06_075854098_0 using c4cw version 641 7/9/2012 2:55:21 PM | World Community Grid | Started upload of c4cw_target06_075840594_0_0 7/9/2012 2:55:24 PM | World Community Grid | [error] Error reported by file upload server: Maintenance underway: file uploads are temporarily disabled. 7/9/2012 2:55:24 PM | World Community Grid | Temporarily failed upload of c4cw_target06_075840594_0_0: transient upload error 7/9/2012 2:55:24 PM | World Community Grid | Backing off 10 min 30 sec on upload of c4cw_target06_075840594_0_0 Sorry knreed, but it seems while it's pretty much gone, it's still there, just only very intermittently now. :) ![]() i5-12600K (3.7GHz), 32GB DDR5, Win11 64bit Home |
||
|
motech
Cruncher Joined: Mar 30, 2007 Post Count: 23 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Shortly after the last post in this thread we made our final change that should resolve this issue. We have seen significantly greater stability since then - those of you who were seeing this issue, can you confirm that you have not experienced it in the last 3 days? Things are back to normal here. Thanks! |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Actually I pretty much jinxed us. Within 20 minutes of me posting, we saw the issue re-appear. For those who are technically inclined:
----------------------------------------The file system that supports the files for download (input files) and files uploaded (result files) is built on GPFS and connected to SAN. We recently expanded the filesystem from four nodes to a fifth node (the final step in our most recent round of growth work). We mistakenly decided to go with five quorum nodes so that the cluster would remain intact if we lost up to two servers. We failed to anticipate the additional load that having five quorum nodes would be put on the File Manager and Cluster Manager. As a result, some existing issues were exposed. What specifically happens is that the load on the File Manager and Cluster Manager becomes high enough that they cannot keep up with the incoming requests. Since the cluster had a tendency to place these onto nodes that were file upload/download servers (i.e. the busiest) the backup further degraded the performance of the server and caused the managers to fall further behind. Not the way you want performance to degrade during high load..... The issues that were exposed was that there are some issues with MSI interrupts that contributed to the degraded performance - especially in regard to high network communication. Last week we added the pci=nomsi option to the boot command line for the servers to eliminate this issue. Additionally, we installed the latest fixpack for GPFS which included a fix to an issue that we believed to be causing some delayed responses. Also, since the cluster was placing the the File Manager and Cluster Manager onto the busiest node in the cluster. We manually forced it onto a quieter node (and set that node to automatically take over the manager roles when it is rebooted). Following that, we achieved stability. Unfortunately, there was some maintenance work that was performed in the environment on Sunday morning. This caused the File Manager and Cluster Manager to be moved and the cluster again selected the busiest server for them. Eventually, we hit a situation where things degraded again. We have manually forced the managers back to the quieter node, but we are now going to change things to eliminate the possibility of cluster moving the managers back to the busier nodes. This work will be completed over the next 24 hours. Specifically, we are switching to using 2 quorum nodes with a tiebreaker disk. This will reduce the overhead in managing the cluster and make sure that the managers stay on nodes that are quieter. The good news in all of this is that as we were working through these problems, a lot of good work was done on optimizing the performance of the cluster and things are MUCH faster now. This next change should resolve the degrading performance issue and put us in good shape for a long time to come. [Edit 1 times, last edit by knreed at Jul 10, 2012 1:54:18 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
A few minutes ago, I just had a synch between the WCG-servers and my machines for CFSW. That synch went flawless and fast. All lights flashing green and go here my side. It is indeed a ...
Good day ![]() ; |
||
|
|
![]() |