World Community Grid - View Thread - Scheduled Maint. July 18, 14:00 UTC, Extended?

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Scheduled Maint. July 18, 14:00 UTC, Extended?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 143

[ ]

Author

This topic has been viewed 172537 times and has 142 replies

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

So the data that you access when you upload and download files sits on a clustered file system. The maintenance window yesterday was scheduled to install the latest kernel on the servers. We completed all the servers associated with our databases, load balancing and website with no issue. We updated the first server associated with this file system with no issue.

However, after rebooting the second server, it marked its disks as 'unrecovered'. The cluster file system has a mechanism for recovering and restoring normal operations, but there was a second issue that is causing that process to run at a much slower pace. We are working on talking to 3rd layer support for the clustered file system software to find out if there is a faster way that we can run the recovery utility.

We do not expect any lose of data, but the utility is extremely careful which makes it very slow in running.

----------------------------------------
[Edit 1 times, last edit by knreed at Jul 19, 2017 11:26:08 AM]

[Jul 19, 2017 11:25:35 AM]

duanebong
Advanced Cruncher
Singapore
Joined: Apr 25, 2009
Post Count: 134
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

"and there is no setting to control it in BOINC's preferences".
There is, I think if you tick the "Show advanced..." box, you get to set the space or percent.

You can set the max used storage space but it doesn't download extra wu's. If that's what this setting is for it doesn't work.

Same for me. In advanced setting we can control how much is the min remaining disk space BOINC will ensure the phone has, and also the max amount of space BOINC is allowed to consume. But there is no way to control how many days of WU Android phones keep in their buffer.

Generally the phone crunches on 1 WU and then downloads 1 or 2 extra WU in reserve. It adjusts for the number of cores you've allowed BOINC to use. So if you have an 8 core phone and allow 2 cores to be used for crunching, the phone will have a total of 4-6 WUs downloaded (working on 2 WUs + another 2-4 in reserve).

It would be useful to be able to control it more. Not just to cover server outages, but some times you could be on the road with no WIFI access. It would be nice to pre-load the buffer before leaving the house.

----------------------------------------

[Jul 19, 2017 11:33:32 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Was using KSplice for long until Oracle proprietized it, but Ubuntu now too has boot-less kernel updating.

----------------------------------------
[Edit 1 times, last edit by SekeRob* at Jul 19, 2017 11:43:15 AM]

[Jul 19, 2017 11:42:25 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

"and there is no setting to control it in BOINC's preferences".

There is, I think if you tick the "Show advanced..." box, you get to set the space or percent.

You can set the max used storage space but it doesn't download extra wu's. If that's what this setting is for it doesn't work.

My unit had one active wu (only using one core) and it has been idle since it completed. Max storage space is set to 90% and there's plenty of free space but it isn't being used for more wu's.

I wonder if this is a bug?

The space in which BOINC is installed by default is limited. Think they're working towards a new release removing some overdone controls. /OT

[Jul 19, 2017 11:52:13 AM]

mmonnin
Advanced Cruncher
Joined: Jul 20, 2016
Post Count: 148
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Also, why do maintenance in the MIDDLE of the week? Unless it was absolute emergency, maintenance should wait until WEEKENDs.

Deferring planned system maintenance work to the weekend makes pirfect sense for an enterprise that runs at maximum only five days a week with reduced usage during the weekend. Doing so minimized the impact by affecting only the small number of weekend workers.

With an operation that runs 24/7 with users around the globe, it makes no sense to defer planned work to any specific day. When the work comes up on the schedule and the manpower to do it is available, it makes sense to do it during the staff's regular work day because there is no time of reduced use.

This project may run 24/7 but that doesn't admins work 24/7. The people actually doing the upgrade I'm guessing do not have 24/7 coverage of a full staff.

Tuesday is the typical upgrade day of the week. As an example, Microsoft releases patches on Tuesdays. Monday is a day to clean up anything from the weekend when most do not work. Patch on Tuesday so you have 3-4 days to fix/roll back/verify patch before the next weekend.

----------------------------------------

[Jul 19, 2017 11:58:32 AM]

Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:

90 day badge for Help Cure Muscular Dystrophy

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

90 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Computing for Clean Water

2 year badge for Computing for Sustainable Water

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

aaaand it's still broke

----------------------------------------

Currently being moderated under false pretences

[Jul 19, 2017 12:09:58 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Ouch, all my Valid workunits uploaded at or after 14:21:56 UTC 18 July have turned Invalid. More tidying up to do, techs!

[Jul 19, 2017 12:35:06 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Ouch, all my Valid workunits uploaded at or after 14:21:56 UTC 18 July have turned Invalid. More tidying up to do, techs!

Yes, same here - so much for the statement that there'll be no loss of data!!!

Bang goes my machines reliability status😫

----------------------------------------

[Jul 19, 2017 12:38:45 PM]

Sandvika
Advanced Cruncher
United Kingdom
Joined: Apr 27, 2007
Post Count: 112
Status: Offline
Project Badges:

90 day badge for The Clean Energy Project - Phase 2

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

I don't like to be profane in a forum angel

but I have seen the funny side and hopefully others will have a chuckle too: this upgrade is quite literally a clusterf***! clown

Good luck fixing it

As an IT tech with 30+ years behind me I've been wary of the dash to cloud technology. It seems there's a lot to be said for spreading risk of unavailability across multiple vendors let alone multiple data centres, and if something is absolutely mission critical, as in this case, then having it in a hybrid environment such that overall control and therefore availabillity is ensured surely makes sense. I appreciate that cluster replication is limited to wire speeds unless in a single virtualised host, but there's no noticable loss of performance when my ZFS is resilvering after replacing a failed drive and I'd expect the same of a well implemented clustered filesystem. In this scenario I'd not expect failure of a single server to degrade performance, let alone to be fatal. I'd expect it to be more like an air crash investigation - very seldom is a whole fleet grounded after a single catastophe and the investigative emphasis is on preventing a recurrence and developing contingencies, not a dash to get airborne again.

Meanwhile I have discovered how to recover my Rosetta@home profile from the BOINC client config files so I no longer have 40 idle cores!

----------------------------------------

[Jul 19, 2017 12:43:53 PM]

kkwok
Cruncher
Joined: Nov 23, 2004
Post Count: 5
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

14 day badge for Computing for Sustainable Water


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Care to share how to add another project under BOINC client?

[Jul 19, 2017 12:57:54 PM]

[ ]