Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 79
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
..........in order to ensure high performance over the next several years................). Statements like this means a lot to me, it gives assurance of the commitment of IBM to our efforts! Well said Kevin ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Here's the last tasks per day chart update, the daily version, for a visualization of the last 6 months: http://bit.ly/WCGTPD . Something set this off and it's hard to see from the outside what/why it would.
Things ran for a while quite nicely and then a science comes back out of intermittent [adding not 100 results/day... gradual ramp-up selected] and the whole performance goes pear-shaped. Popped me right back to the Alien nation search project info, where they install the latest greatest BOINC server software versions first to give any changes a production environment exposure [do they really?] AND where they posted a few months ago to 4 fold size up the tasks because the schedulers could not handle the speed of the clients anymore, too many timeouts [yes they actually posted that]. Would be worried, but it's not mine to worry about. |
||
|
jonnieb-uk
Ace Cruncher England Joined: Nov 30, 2011 Post Count: 6105 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Statements like this means a lot to me IMHO statements like this mean a lot to anyone interested in the future direction of WCG, and are of interest to the WCG (forum) community at large. Whilst it may be a response to the original topic it's surely important enough to deserve its own thread? Or may be all posts by WCG staff could be in duplicate - in the original thread and in a seperate read-only WCG staff thread? Or a WCG blog? |
||
|
dango
Senior Cruncher Joined: Jul 27, 2009 Post Count: 307 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We have started things up again. Unfortunately we didn't get as much performance increase as we hoped. We have ordered additional memory for the servers which we expect to get installed in the next 2-3 weeks. -- knreed [Jan 9, 2013 8:45:26 PM] post I have reserved memory-hardware within my arm's reach as a contingency just in case my machines would need them. I have a dozen of them memory-sticks. The specs of that memory is amazing: 4-TBytes per stick, DDR-version-10, octo-channel, 1.2-Terahertz, 1-mWatt power-consumption. Just give me a call if you need some... ![]() Seriously now, what prospects are we looking at for ramping up things back to normal operational-speed system-wide -- 2 to 3 weeks? ![]() ; ; andzgridPost#799 ; I've tweaked a number of things and the backlog for hcc1 validation is now only 47,000 workunits and getting smaller. The processes that delete files on the server after archiving the good results and the processes that delete records from the database are ~1,000,000 records behind but also catching up after the changes just made. The database itself is about 158GB and it is running on 64GB of RAM. We will be increasing the database server to 128GB of RAM. The storage for the database server is on a shared SAN device with 15k RPM disks in a RAID 5 config. We will be looking things over in detail over the next few months to determine if we should get our own dedicated SAN device and put the databases on SDD drives in order to ensure high performance over the next several years. We will also be looking at replacing the db servers with more powerful servers that can have up to 256GB (or more) of RAM. However, these types of bigger hardware purchases take much more time to review, approve and install (i.e. many months). knreed, wcg should order ibm power p770 for the databases. i'm admin of these 2xp770 (1tb ram) - 700gb oracle db (running on aix 6.1), 11 cpu cores and running very well / or on the other hand make some tuning on gpfs configuration ![]() [Edit 1 times, last edit by dango at Jan 10, 2013 8:16:32 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the response, knreed. It was a re-assuring read for me.
----------------------------------------About the down/up-loading with the WCG-server, minutes ago I just had a fast and trouble-free sync with the WCG-servers like others of my earlier sync-sessions for the past 18hrs or so. No longer do I see the BOINC-message: "Project is temporarily shut down for maintenance" that I once saw for a number of times. ; ; andgridPost#802 ; [Edit 1 times, last edit by Former Member at Jan 10, 2013 11:00:55 PM] |
||
|
shanen
Cruncher Joined: Nov 17, 2004 Post Count: 27 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The server has been unusually strained the last few days, but the glitch that bugs me most these days is when it sends me a bunch of work units with impossibly short deadlines. It was doing a lot of that last weekend... All of this stuff about deadlines seems pretty pointless from the donor's side. Sorry, but there's NO way they can predict my plans for when I'll be using a computer and how long it will be running.
Right now my proximate problem is with a new computer that is going to finish its only work unit while I'm at lunch, but for some reason it is refusing to download any new work units at all. That's a relatively unusual glitch and doesn't make any sense, since another computer I'm using seems to be having no problem getting fresh work units. (Neither of them is getting any credit for the completed work, however, which is more of a "normal glitch" state.) |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Those are client problems, not server problems. BOINC has a serious issue, always has, with highly variable "on" time [BOINC learns slow]. Suggest to try 7.0.42 [rather stable for CPU crunchers and most GPU crunchers alike]. It is coded to fetch a new task minutes before a current one finishes, if you allow it to, even when the buffering is set to zero. I've yet to see it fail not fetching work in time and been running this release while on wings for several week.
The only way to get out of short deadline tasks receiving is to actually up your cache, so no result will return within 48 hours [and in it self takes the server for longer to understand and kick a host out of the "reliable returner" state. If a system is on for say 4 hours a day and a task lasts 10 hours, then that condition is met. This with exception for Beta and HCC, which could have a shortest deadline of 1.6 and 2.8 days respectively. Deselecting these 2 ensures that the shortest deadline you'd ever see is 4 days (40% for repair jobs of original 10 day deadline for all other sciences). A web device profile button to not get the shorties has been requested and does have merit, at least to me, without having to revert to configuration contortions. It's as simple as "never regard this host as [quick] returner" as in "never assume work will come back within 48 hours", which is one of the 2 main criteria for receiving short deadline tasks. Scanned logs on 3 hosts from last midnight. Not a single server bounce was recorded. Maybe I was lucky... touch wood. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
P.S. On the servers' "flaky" comment elsewhere, the previous 24 hours saw a validation record throughput at WCG [to catch up on backlog] of first time over 2 million results.
01/10/2013 475:135:18:19:27 1,108,497,733 2,021,243 ![]() |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
hhhmm... Seems we are having "Project is temporarily shut down for maintenance" issues again...
----------------------------------------![]() [EDIT] Validators are starting to backup also... CJSL Crunching for a better future... ---------------------------------------- [Edit 1 times, last edit by cjslman at Jan 11, 2013 11:06:30 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
even single rep wu like malaria is about 2.5 hours behind now. been climbing all day. overall validator problem I guess.
----------------------------------------I also have alot cancer that both reps have completed . i [Edit 1 times, last edit by Former Member at Jan 11, 2013 11:01:59 PM] |
||
|
|
![]() |