World Community Grid - View Thread - Server maintenance??

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Server maintenance??

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 79

[ ]

Author

This topic has been viewed 5332 times and has 78 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

..........in order to ensure high performance over the next several years................).

Statements like this means a lot to me, it gives assurance of the commitment of IBM to our efforts! Well said Kevin biggrin

[Jan 10, 2013 5:08:29 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

Here's the last tasks per day chart update, the daily version, for a visualization of the last 6 months: http://bit.ly/WCGTPD . Something set this off and it's hard to see from the outside what/why it would.

Things ran for a while quite nicely and then a science comes back out of intermittent [adding not 100 results/day... gradual ramp-up selected] and the whole performance goes pear-shaped. Popped me right back to the Alien nation search project info, where they install the latest greatest BOINC server software versions first to give any changes a production environment exposure [do they really?] AND where they posted a few months ago to 4 fold size up the tasks because the schedulers could not handle the speed of the clients anymore, too many timeouts [yes they actually posted that]. Would be worried, but it's not mine to worry about.

[Jan 10, 2013 5:30:22 PM]

jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Server maintenance??

Statements like this means a lot to me

IMHO statements like this mean a lot to anyone interested in the future direction of WCG, and are of interest to the WCG (forum) community at large. Whilst it may be a response to the original topic it's surely important enough to deserve its own thread? Or may be all posts by WCG staff could be in duplicate - in the original thread and in a seperate read-only WCG staff thread? Or a WCG blog?

----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread

[Jan 10, 2013 5:39:25 PM]

dango
Senior Cruncher
Joined: Jul 27, 2009
Post Count: 307
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Server maintenance??

We have started things up again. Unfortunately we didn't get as much performance increase as we hoped. We have ordered additional memory for the servers which we expect to get installed in the next 2-3 weeks. -- knreed [Jan 9, 2013 8:45:26 PM] post

I have reserved memory-hardware within my arm's reach as a contingency just in case my machines would need them. I have a dozen of them memory-sticks. The specs of that memory is amazing: 4-TBytes per stick, DDR-version-10, octo-channel, 1.2-Terahertz, 1-mWatt power-consumption. Just give me a call if you need some... laughing

Seriously now, what prospects are we looking at for ramping up things back to normal operational-speed system-wide -- 2 to 3 weeks? thinking

;
; andzgridPost#799
;

I've tweaked a number of things and the backlog for hcc1 validation is now only 47,000 workunits and getting smaller. The processes that delete files on the server after archiving the good results and the processes that delete records from the database are ~1,000,000 records behind but also catching up after the changes just made. The database itself is about 158GB and it is running on 64GB of RAM. We will be increasing the database server to 128GB of RAM.

The storage for the database server is on a shared SAN device with 15k RPM disks in a RAID 5 config. We will be looking things over in detail over the next few months to determine if we should get our own dedicated SAN device and put the databases on SDD drives in order to ensure high performance over the next several years. We will also be looking at replacing the db servers with more powerful servers that can have up to 256GB (or more) of RAM. However, these types of bigger hardware purchases take much more time to review, approve and install (i.e. many months).

knreed, wcg should order ibm power p770 for the databases. i'm admin of these 2xp770 (1tb ram) - 700gb oracle db (running on aix 6.1), 11 cpu cores and running very well / or on the other hand make some tuning on gpfs configuration cool

----------------------------------------
[Edit 1 times, last edit by dango at Jan 10, 2013 8:16:32 PM]

[Jan 10, 2013 8:15:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

Thanks for the response, knreed. It was a re-assuring read for me.

About the down/up-loading with the WCG-server, minutes ago I just had a fast and trouble-free sync with the WCG-servers like others of my earlier sync-sessions for the past 18hrs or so. No longer do I see the BOINC-message: "Project is temporarily shut down for maintenance" that I once saw for a number of times.
;
; andgridPost#802
;

----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 10, 2013 11:00:55 PM]

[Jan 10, 2013 10:58:00 PM]

shanen
Cruncher
Joined: Nov 17, 2004
Post Count: 27
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

14 day badge for Influenza Antiviral Drug Search

45 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

90 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

1 year badge for Microbiome Immunity Project


Re: Server maintenance??

The server has been unusually strained the last few days, but the glitch that bugs me most these days is when it sends me a bunch of work units with impossibly short deadlines. It was doing a lot of that last weekend... All of this stuff about deadlines seems pretty pointless from the donor's side. Sorry, but there's NO way they can predict my plans for when I'll be using a computer and how long it will be running.

Right now my proximate problem is with a new computer that is going to finish its only work unit while I'm at lunch, but for some reason it is refusing to download any new work units at all. That's a relatively unusual glitch and doesn't make any sense, since another computer I'm using seems to be having no problem getting fresh work units. (Neither of them is getting any credit for the completed work, however, which is more of a "normal glitch" state.)

[Jan 11, 2013 2:47:39 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

Those are client problems, not server problems. BOINC has a serious issue, always has, with highly variable "on" time [BOINC learns slow]. Suggest to try 7.0.42 [rather stable for CPU crunchers and most GPU crunchers alike]. It is coded to fetch a new task minutes before a current one finishes, if you allow it to, even when the buffering is set to zero. I've yet to see it fail not fetching work in time and been running this release while on wings for several week.

The only way to get out of short deadline tasks receiving is to actually up your cache, so no result will return within 48 hours [and in it self takes the server for longer to understand and kick a host out of the "reliable returner" state. If a system is on for say 4 hours a day and a task lasts 10 hours, then that condition is met. This with exception for Beta and HCC, which could have a shortest deadline of 1.6 and 2.8 days respectively. Deselecting these 2 ensures that the shortest deadline you'd ever see is 4 days (40% for repair jobs of original 10 day deadline for all other sciences). A web device profile button to not get the shorties has been requested and does have merit, at least to me, without having to revert to configuration contortions. It's as simple as "never regard this host as [quick] returner" as in "never assume work will come back within 48 hours", which is one of the 2 main criteria for receiving short deadline tasks.

Scanned logs on 3 hosts from last midnight. Not a single server bounce was recorded. Maybe I was lucky... touch wood.

[Jan 11, 2013 10:22:50 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

P.S. On the servers' "flaky" comment elsewhere, the previous 24 hours saw a validation record throughput at WCG [to catch up on backlog] of first time over 2 million results.

01/10/2013 475:135:18:19:27 1,108,497,733 2,021,243

dancing

[Jan 11, 2013 10:30:30 AM]

cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Sustainable Water

5 year badge for Outsmart Ebola Together

5 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: Server maintenance??

hhhmm... Seems we are having "Project is temporarily shut down for maintenance" issues again... sad

[EDIT] Validators are starting to backup also...

CJSL

Crunching for a better future...

----------------------------------------

I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team

----------------------------------------
[Edit 1 times, last edit by cjslman at Jan 11, 2013 11:06:30 PM]

[Jan 11, 2013 8:41:08 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server maintenance??

even single rep wu like malaria is about 2.5 hours behind now. been climbing all day. overall validator problem I guess.
I also have alot cancer that both reps have completed .

i

----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 11, 2013 11:01:59 PM]

[Jan 11, 2013 11:01:22 PM]

[ ]