Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Thread Type: Sticky Thread Total posts in this thread: 427
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 983 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Next potential issue: Keep an eye on the OPNG credit. They seems to have dropped down to OPN1 levels. OPNG points were usually 5-600 (BOINC points), but the one I just crunched now only claimed 105.6 Boinc points. I'll wait and see what my wingman claims, and what the granted credit ends up being. Example from my very slow GeForce GTX 660M GPU. The windows 8.1 computer: https://www.worldcommunitygrid.org/contribution/workunit/152869783 Edit: Sure, that task only had 14 "jobs", but 105.6 is really low for an OPNG task. So far I've only had one OPNG task (which was requested at about 03:22 UTC today but didn't finish downloading until about 07:45 UTC...) and it looks about right regarding credit. It was OPNG_0155739_00367_0, which has a receptor/target I've not see before (7los_A-001--TYR268_inert). It had 14 jobs with ligands varying from 22 to 30 atoms in size and having 8, 9 or 10 branches -- those numbers of branches make for longer-running jobs within a WU. It ran on a GTX 1660 Ti and consumed 136 seconds CPU, 168 seconds elapsed. It didn't get a wingman (because of adaptive replication). My system asked for 58.6 credit (clients will under-claim for OPNG!) and it got assigned 1007.1 credit (which seems reasonably generous based on past figures!...) By the way, unless there's been some major change, OPNG credit is computed by WCG based on the estimated difficulty of the jobs in a WU and how many iterations(?) the GPU took to resolve each docking set. Cheers - Al. P.S. file transfer issues are still a pain... I have some at seven hours and counting, but that's for elsewhere... |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 303 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Representative Event Log entries for problem I posted earlier:
8/19/2022 6:50:38 AM | World Community Grid | Started download of 869c4f9f2adffc5257712744e0a3d549.zip 8/19/2022 6:50:38 AM | World Community Grid | [file_xfer] URL: https://download.worldcommunitygrid.org/boinc...ffc5257712744e0a3d549.zip 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: Hostname download.worldcommunitygrid.org was found in DNS cache 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: Trying 199.241.167.118:443... 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: Connected to download.worldcommunitygrid.org (199.241.167.118) port 443 (#250) 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: schannel: disabled automatic use of client certificate 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: ALPN: offers http/1.1 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Info: ALPN: server accepted http/1.1 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: GET /boinc/download/239/869c4f9f2adffc5257712744e0a3d549.zip HTTP/1.1 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: Host: download.worldcommunitygrid.org 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.20.2) 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: Accept: */* 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: Accept-Encoding: deflate, gzip 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: Accept-Language: en_US 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <nbytes>9132.000000</nbytes> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <max_nbytes>0.000000</max_nbytes> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <status>0</status> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <persistent_file_xfer> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <num_retries>8</num_retries> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <first_request_time>1660894123.333477</first_request_time> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <next_request_time>0.000000</next_request_time> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <time_so_far>28.681662</time_so_far> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <last_bytes_xferred>107.000000</last_bytes_xferred> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <is_upload>0</is_upload> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: </persistent_file_xfer> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <file_xfer> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <bytes_xferred>0.000000</bytes_xferred> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <file_offset>0.000000</file_offset> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <xfer_speed>0.000000</xfer_speed> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: <url>https://download.worldcommunitygrid.org/boinc...4e0a3d549.zip</url> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: </file_xfer> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: </file_transfer> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: </file_transfers> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: </boinc_gui_rpc_reply> 8/19/2022 6:50:38 AM | World Community Grid | [http] [ID#223] Sent header to server: 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Info: Mark bundle as not supporting multiuse 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Info: HTTP 1.0, assume close after body 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: HTTP/1.0 503 Service Unavailable 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: Cache-Control: no-cache 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: Connection: close 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: Content-Type: text/html 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: <html><body><h1>503 Service Unavailable</h1> 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: No server is available to handle this request. 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Received header from server: </body></html> 8/19/2022 6:50:40 AM | | [http_xfer] [ID#223] HTTP: wrote 107 bytes 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Info: schannel: server closed the connection 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Info: Closing connection 250 8/19/2022 6:50:40 AM | World Community Grid | [http] [ID#223] Info: schannel: shutting down SSL/TLS connection with download.worldcommunitygrid.org port 443 8/19/2022 6:50:41 AM | World Community Grid | [file_xfer] http op done; retval -184 (transient HTTP error) 8/19/2022 6:50:41 AM | World Community Grid | [file_xfer] file transfer status -184 (transient HTTP error) 8/19/2022 6:50:41 AM | World Community Grid | Temporarily failed download of 869c4f9f2adffc5257712744e0a3d549.zip: transient HTTP error 8/19/2022 6:50:41 AM | World Community Grid | Backing off 05:45:47 on download of 869c4f9f2adffc5257712744e0a3d549.zip -Bruce |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2171 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Next potential issue: Keep an eye on the OPNG credit. They seems to have dropped down to OPN1 levels. OPNG points were usually 5-600 (BOINC points), but the one I just crunched now only claimed 105.6 Boinc points. I'll wait and see what my wingman claims, and what the granted credit ends up being. So far I've only had one OPNG task [..] and it looks about right regarding credit. It ran on a GTX 1660 Ti [..]. My system asked for 58.6 credit (clients will under-claim for OPNG!) and it got assigned 1007.1 credit [..] Yep, sounds about right, Al. Looking through my logfile, I see 95 OPNG-tasks have been validated. After selecting three tasks by hand, this is what comes up: $ wcgstats -w -rrr -= OPNG_0154244_00575_1 *** What you see here, is that my first task claims 58.6 credits and is granted 1001.5 credits, the second task on its own claims 4.5 credits and gets 860.9, while the third one claims 1.3 credits and gets 995.1, also without a wingman. *** So this seems to be quite a normal pattern. Claiming a low amount, being granted a fair amount. Adri [Edit 4 times, last edit by adriverhoef at Aug 19, 2022 11:56:43 AM] |
||
|
KAMasud
Cruncher Joined: Nov 18, 2006 Post Count: 20 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
8/19/2022 3:27:54 PM | GPUGRID | Project requested delay of 31 seconds
8/19/2022 3:38:57 PM | World Community Grid | Started download of wcgrid_arp1_wrf_7.32_windows_x86_64 8/19/2022 3:38:57 PM | World Community Grid | Started download of arp1_image02_7.32.tga 8/19/2022 3:39:01 PM | World Community Grid | Temporarily failed download of wcgrid_arp1_wrf_7.32_windows_x86_64: transient HTTP error 8/19/2022 3:39:01 PM | World Community Grid | Backing off 00:33:52 on download of wcgrid_arp1_wrf_7.32_windows_x86_64 8/19/2022 3:39:01 PM | World Community Grid | Temporarily failed download of arp1_image02_7.32.tga: transient HTTP error 8/19/2022 3:39:01 PM | World Community Grid | Backing off 00:51:14 on download of arp1_image02_7.32.tga 8/19/2022 4:12:32 PM | World Community Grid | Started download of arp1_image04_7.32.tga 8/19/2022 4:12:32 PM | World Community Grid | Started download of arp1_image07_7.32.tga 8/19/2022 4:12:36 PM | World Community Grid | Temporarily failed download of arp1_image04_7.32.tga: transient HTTP error 8/19/2022 4:12:36 PM | World Community Grid | Backing off 00:05:09 on download of arp1_image04_7.32.tga 8/19/2022 4:12:36 PM | World Community Grid | Temporarily failed download of arp1_image07_7.32.tga: transient HTTP error 8/19/2022 4:12:36 PM | World Community Grid | Backing off 00:07:27 on download of arp1_image07_7.32.tga 8/19/2022 4:12:36 PM | World Community Grid | Started download of ARP1_0024378_130_ARP1_0024378_130.input 8/19/2022 4:12:36 PM | World Community Grid | Started download of 2567cbbbeeafd709a0fc44bf311f236e. 8/19/2022 4:12:40 PM | World Community Grid | Finished download of ARP1_0024378_130_ARP1_0024378_130.input 8/19/2022 4:12:40 PM | World Community Grid | Started download of d6e8192344528f5bbcc62875e2125a48. 8/19/2022 4:12:43 PM | World Community Grid | Temporarily failed download of d6e8192344528f5bbcc62875e2125a48.: transient HTTP error 8/19/2022 4:12:43 PM | World Community Grid | Backing off 00:07:43 on download of d6e8192344528f5bbcc62875e2125a48. 8/19/2022 4:12:43 PM | World Community Grid | Started download of dcc3bd5ae0338bea45b2099aa785b82d. 8/19/2022 4:12:47 PM | World Community Grid | Temporarily failed download of dcc3bd5ae0338bea45b2099aa785b82d.: transient HTTP error 8/19/2022 4:12:47 PM | World Community Grid | Backing off 00:06:50 on download of dcc3bd5ae0338bea45b2099aa785b82d. 8/19/2022 4:12:47 PM | World Community Grid | Started download of 461ee780af6f207e4925d239791b21f0. 8/19/2022 4:12:51 PM | World Community Grid | Temporarily failed download of 461ee780af6f207e4925d239791b21f0.: transient HTTP error 8/19/2022 4:12:51 PM | World Community Grid | Backing off 00:07:42 on download of 461ee780af6f207e4925d239791b21f0. 8/19/2022 4:12:51 PM | World Community Grid | Started download of 8b67dd136d17b8acef5182a390c9f2fc.7z 8/19/2022 4:12:55 PM | World Community Grid | Temporarily failed download of 8b67dd136d17b8acef5182a390c9f2fc.7z: transient HTTP error 8/19/2022 4:12:55 PM | World Community Grid | Backing off 00:06:34 on download of 8b67dd136d17b8acef5182a390c9f2fc.7z 8/19/2022 4:12:55 PM | World Community Grid | Started download of opn1.AD_reactive_v.0.99.dat 8/19/2022 4:12:59 PM | World Community Grid | Temporarily failed download of opn1.AD_reactive_v.0.99.dat: transient HTTP error 8/19/2022 4:12:59 PM | World Community Grid | Backing off 00:04:59 on download of opn1.AD_reactive_v.0.99.dat 8/19/2022 4:14:22 PM | World Community Grid | Finished download of 2567cbbbeeafd709a0fc44bf311f236e. 8/19/2022 4:14:22 PM | World Community Grid | Started download of 2102bdfdb3a6276b2a6592c20e80193d.7z 8/19/2022 4:14:22 PM | World Community Grid | Started download of c9f8115b297f493662823636a05f3075.7z 8/19/2022 4:14:25 PM | World Community Grid | Temporarily failed download of 2102bdfdb3a6276b2a6592c20e80193d.7z: transient HTTP error 8/19/2022 4:14:25 PM | World Community Grid | Backing off 00:05:34 on download of 2102bdfdb3a6276b2a6592c20e80193d.7z 8/19/2022 4:14:26 PM | World Community Grid | Started download of wcgrid_arp1_wrf_7.32_windows_x86_64 8/19/2022 4:14:26 PM | World Community Grid | Temporarily failed download of c9f8115b297f493662823636a05f3075.7z: transient HTTP error 8/19/2022 4:14:26 PM | World Community Grid | Backing off 00:05:11 on download of c9f8115b297f493662823636a05f3075.7z 8/19/2022 4:14:27 PM | World Community Grid | Started download of arp1_image08_7.32.tga 8/19/2022 4:14:30 PM | World Community Grid | Temporarily failed download of wcgrid_arp1_wrf_7.32_windows_x86_64: transient HTTP error 8/19/2022 4:14:30 PM | World Community Grid | Backing off 01:53:46 on download of wcgrid_arp1_wrf_7.32_windows_x86_64 8/19/2022 4:14:31 PM | World Community Grid | Temporarily failed download of arp1_image08_7.32.tga: transient HTTP error 8/19/2022 4:14:31 PM | World Community Grid | Backing off 00:04:36 on download of arp1_image08_7.32.tga |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 277 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
The pattern I'm seeing is that a client will connect, then, if lucky, it might successfully download 1-4 files before further downloads time out. The size of the files matters little; it will happily hang on a sub-1K file.
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 983 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding file download issues - I see others are trying to determine a possible cause, so I'll just comment on effects :-)
----------------------------------------At least when there was only one big burst of jobs a day it was possible to encourage transfers fairly quickly (if one was aware of the issue and willing to poke the client transfer mechanism...) as the system would soon become more responsive.; however, now that there seems to be a more constant stream of work (and, presumably, more of it) long delays in downloading present new issues. On one of my larger systems, it had been 7 hours since it last managed to request work, and I have only just persuaded the last two files to download, unlocking two tasks (which has resulted in showing up another issue!); on another, there are still four tasks waiting for one file each after 4 hours (not quite so bad, but...) One issue is that whilst transfers are marked as stalled no further work can be requested, and by the time these tasks have finally downloaded most of the work I was sent will have been completed (apart from ARP1!), so it will end up asking for more tasks all at once and it'll promptly go into "can't download" mode again. Not good, especially if the system is unattended at the time, as the retry interval will go up and up... Another issue is that if all the tasks for a given project finish before there are any replacement tasks, it will probably clear out some of the data files that are shared between tasks - this is most obvious with MCM1, where the file mcm1.dataset-sarc1.txt is currently being re-downloaded (all 102MB of it!) because I'd returned all the existing MCM1 tasks before the download backlog had cleared. I suspect that if I could get my usual balance of tasks cached the problem would [mostly] disappear as requests would be for top-up rather than for a more or less complete refill. And I suspect there are other users in the same situation... Ralf (TPCBF) is right in his observation earlier in this thread that files that would fit in a single packet [if it weren't for header stuff and other such overheads!] tend to be the real problem files; I have found that anything that claims to have fetched a small portion before backing off is more likely to fail again on a retry, so that also includes some files up to several hundred kilobytes; OPN1/OPNG gets hit quite hard by this, as does MCM1... One trick I've found that sometimes works is that if I see one of those files failing then if I immediately hit retry on that specific file it downloads! As I've got a convenient backlog of short files, I've just tested that hypothesis on them all; a couple of them needed prodding more than once, but I've cleared them all out. Note that the technique is nowhere near as effective for dealing with the slightly larger (but not megabytes) files... However, it's not really server-friendly to keep mashing the Retry button, so a fix is obviously preferable! Cheers - Al. P.S. excessively large work caches do not constitute a work-around; too many tasks are ending up with "No Reply" retries already... [Edited to correct the comment about OPNG receptor files...] [Edit 1 times, last edit by alanb1951 at Aug 19, 2022 1:20:15 PM] |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2209 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
<snip to keep the msg shorter>
What you see here, is that my first task claims 58.6 credits and is granted 1001.5 credits, the second task on its own claims 4.5 credits and gets 860.9, while the third one claims 1.3 credits and gets 995.1, also without a wingman. *** So this seems to be quite a normal pattern. Claiming a low amount, being granted a fair amount. Adri |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
All of the files on my Windows 7 machine managed to finish downloading overnight. It appeared it was mostly the MCM small files which seemed to hang the most. However, on my Linux machine all of the 11 MCM work units ended with an error:
----------------------------------------WU download error: couldn't get input files: <file_xfer_error> <file_name>mcm1.dataset-sarc1.txt</file_name> <error_code>-200 (wrong size)</error_code> </file_xfer_error> There was one OPN1 unit on the Linux machine which finally downloaded and is currently running. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1957 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Ralf (TPCBF) is right in his observation earlier in this thread that files that would fit in a single packet [if it weren't for header stuff and other such overheads!] tend to be the real problem files; I have found that anything that claims to have fetched a small portion before backing off is more likely to fail again on a retry, so that also includes some files up to several hundred kilobytes; OPN1/OPNG gets hit quite hard by this, as does MCM1... Well, for one, I don't think you can't just kill that 102MB mcm1.dataset***.txt file, that might be used for more than one WU.But when I got a couple hundreds of WUs overnight on at least a dozen different hosts, I could see this again. Larger files, like that above mentioned dataset file downloaded without an apparent hitch with up to 2MB/sec just fine. But over and over again, those <1K files, they always get stuck at 107 bytes and may (or may not) download if you torture the [Retry Now] button. Also, a lot of people posted about 500 http errors, like 503 Service unavailable and the like. Those are not directly "bandwidth" related issues, but server hardware/OS related issues, where the server (cluster) just can't keep up with the requests. Ralf PS: I am programming for 46 years, do networking for +35 years and sys admin type work for +25 years, so I think I know at least a little bit about the technical side of all this stuff.. ![]() |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, for one, I don't think you can't just kill that 102MB mcm1.dataset***.txt file, that might be used for more than one WU. I think that is absolutely correct. If you have a continuous stream of MCM, this file only downloads once, whereas if you run out of MCM work it will download again when the MCM work resumes. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
![]() |