Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 52
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I had also a bunch of errors from several machines. It seems to me that it happened when the current in-progress-WU finished and the cached WUs are the old version.
One machine (my slowest) finished a longer running HPF2 WU at 11:14 that validated OK. 3 cached HCC WUs errored out immediately with this error: <core_client_version>6.2.28</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>wcg_hcc1_img_6.08_windows_intelx86</file_name> <error_code>-120</error_code> <error_message>signature verification failed</error_message> </file_xfer_error> </message> ]]> Cache was filled with new WUs (4 HCC), but I don't have access to that machine atm, and there are no wingman results yet to tell about the app version. My faster machine (2 core) reported 3 WUs (1 HCC, 2 HPF2) at 03:19 which are valid (1 HPF2 and HCC) resp. PV (1 HPF2). All these old versions. At the same time the machine received 1 HCC 6.40 unit. At 06:18 another HCC with old version was reported that is now in PV. At the same time 18 WUs (3 C4CW, 8 HPF2, 7 HCC) were dumped with errors. Cache was refilled at that time with 11 HPF2 and 3 HCC (all version 6.40). From the fresh downloads 2 HCC and 2 HPF2 are already in PV. (= finished without error) One other machine (also a slow one) is still crunching on a long running HPF2 WU, and still has 1 HPF2, 1 HCC and 1 C4CW with old versions in cache. I will watch what happens when the current WU finishes. I expect that the cached units are then dumped and the cache refilled. All machines (except the first one listed above) are running Boinc 6.10.58 on Windows 32-bit (XP for the first 2 listed above, Win7 for the last one) HTH to understand what the reason is for this. Greetings Thorsten |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 769 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
All appears OK now, not sure why my 3 Linux 64 and other Win PCs did not get hit (yet). Spoke too soon - another 2 windows PCs have errored. Caches re-filled. Linux OK so far... Paul.
Paul.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I also lost several WUs in 5 different machines. 3 of them only run DDDT2 and the other 2 only run HCMD2. Most WU ended in error with 0 CPU Time, but some of them finished, for example:
CMD2_ 1544-1JWY_ A.clustersOccur-3BRW_ B.clustersOccur_ 38_ 91610_ 92351_ 0-- 615 Error 2/27/11 23:07:58 3/1/11 16:36:50 9.63 48.9 / 0.0 ![]() Log:
|
||
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18665 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I just reset the max_results_day setting for users who likely experienced this issue. You should be able to force an update and rebuild your cache. We apologize that this has impacted some users. We did not see this in the test cases we ran prior to performing this change. The client should have completed the old workunits with the old application versions without disruption (the client checks the signature when the application is downloaded). At this point the most critical thing that we need to know is if there is anyone who experienced this issue who is not able to start running the new 6.40 versions as they are downloaded. You should be able to use the 'update project' button to fetch new work. Please try this now if you were limited after the workunits crashed. Then please post if you experienced the issue and then let us know if A) You cannot run the 6.40 versions B) You are able to run the 6.40 versions correctly thanks - and we apologize for the issues. Kevin, had this hit me overnight as well. The signature verification error was the first thing I saw in Boinc's messages. Then the messages about files missing and everything after that got an immediate computation error. Since this started about 2AM on my machine, I suspect that, while you could have replaced the key file late in the day yesterday, it probably wasn't at 1AM or 2AM in the morning. I'm wondering if that just happened to be the first time my machine asked to download/upload files and happened to get the new key file then. Is this in a separate file by itself or in another file with other data? Pure speculation on my part is that it's in a file that doesn't have a unique-per-version-name and it got overlaid at that time. Anything after that would fail on the signature verification until you had gone through your pre-6.40 cache. When I found this this morning, I could only download one WU per core at first. Once those initial WUs completed and were returned, I was able to refill my cache. Just some thoughts that hopefully help. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
... did you have anything unusual set up on your machines other than a large cache? Just to note that it happens with any non-zero cache, it's just less obvious. In the short cache scenario, the older cached WU fails and is replaced with the newer cached 6.40 WU in less than a second, so I'm guessing most users with a short cache haven't even noticed. e.g. 01-Mar-2011 20:41:41 [World Community Grid] Starting CMD2_1551-2QZU_A.clustersOccur-2RC4_A.clustersOccur_7_1 01-Mar-2011 20:41:41 [World Community Grid] [error] Signature verification failed for wcg_hcmd2_maxdo_6.15_i686-pc-linux-gnu 01-Mar-2011 20:41:41 [World Community Grid] Starting CMD2_1549-2QZU_A.clustersOccur-2Q7N_A.clustersOccur_15_64357_65107_64618_65107_0 01-Mar-2011 20:41:41 [World Community Grid] Starting task CMD2_1549-2QZU_A.clustersOccur-2Q7N_A.clustersOccur_15_64357_65107_64618_65107_0 using hcmd2 version 640 01-Mar-2011 20:41:42 [World Community Grid] Computation for task CMD2_1551-2QZU_A.clustersOccur-2RC4_A.clustersOccur_7_1 finished 01-Mar-2011 20:41:42 [World Community Grid] Output file CMD2_1551-2QZU_A.clustersOccur-2RC4_A.clustersOccur_7_1_0 for task CMD2_1551-2QZU_A.clustersOccur-2RC4_A.clustersOccur_7_1 absent If you check how many WUs have dumped with 0 elapsed time and error -120 or -163, I guess you'll find huge numbers. |
||
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
I haven't seen this on any of my machines. I've got 8 running and they all have different OS, processors, cache settings, and speeds.
----------------------------------------Yes, I've been seeing a lot of repair units, but I would expect more if this were a universal problem. ![]() Distributed computing volunteer since September 27, 2000 |
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I agree with KWSN - A Shrubbery. I have lots of machine running and none have this problem. Is this problem only with DDDT2? because I am not running that at this time. Is this going to be a continuing problem until all of the old version get cleared out? Or is it more of a time thing?
----------------------------------------![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I run BOINC 6.10.58 x64 across all my machines.
Machine #4 errored out it's queue early this morning (16 hours ago). This one was crunching HFCC. This problem seems to be completely random and independant of hardware or BOINC version. I have 2 machines left in the "farm" that didn't have any problems(yet). Both crunching HFCC and they have the exact same hardware, BOINC client, OS and internet connection as the one described above. The machines described in my initial post have different processors but everything else is the same. My 2 home computers seem to be unaffected crunching DDDT2 with 6.10.58 and Win 7 x64. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I run BOINC 6.10.56 on all my machines.
----------------------------------------I just noticed my WUs for one machine were all terminating with Error and it's not restricted to DDDT2. For example, from Results Status, is the following: Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit ts01_ b137_ pr02a0_ 1-- DAVETHOMPSON-PC Error 2/27/11 02:21:06 3/2/11 03:15:53 0.00 0.0 / 0.0 ts01_ b129_ pr45b1_ 1-- DAVETHOMPSON-PC Error 2/26/11 23:15:27 3/2/11 03:15:53 0.00 0.0 / 0.0 ts01_ b124_ pca009_ 0-- DAVETHOMPSON-PC Error 2/26/11 21:55:59 3/2/11 03:15:53 0.00 0.0 / 0.0 This is on four pages of the Results Status for a total of 60 WUs in Error. All of the WUs are for DDDT2 except one which is for CEP2. One of the first Error returns is the following: Result Log Result Name: ts01_ b096_ pr45b1_ 0-- <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>wcg_dddt2_charmm_6.17_windows_intelx86</file_name> <error_code>-120</error_code> <error_message>signature verification error</error_message> </file_xfer_error> </message> ]]> This does not appear for most of the other Errors. I exited BOINC on that one machine and restarted; things seem to be running normally. [Edit 1 times, last edit by Former Member at Mar 2, 2011 6:58:29 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Suggest to edit the OP title and remove the DDDT2 part it rpofing to be not science specific and the admin moving this thread to the BOINC Support forum where it's likely seen by more members. One member was correlating this to running Betas recently.
thanks. |
||
|
|
![]() |