World Community Grid - View Thread - Hardware Recovery Update (old)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: Hardware Recovery Update (old)

Quick Go »

No member browsing this thread

Thread Status: Closed
Total posts in this thread: 196

[ ]

Author

This topic has been viewed 1728497 times and has 195 replies

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7670
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Redundancy is not a word that I like to use. I prefer Backup.
Years ago, I worked for an Insurance Company that had very heavy mainframe use and all data was very heavily backed up. They were OK with the few weeks it would take to replace hardware but, when they discovered that it would take 6 months to replace the air conditioning, they built an extra suite as backup.

Mike

Years a go I also worked as a mainframe operator. We always ran 2 sets of backups, one to keep on site and one to keep offsite. Each client got their own set of backup tapes run, so in case there was only corruption on one client there would be a specific backup for them .
The stack of backup tapes would be about 4 ft(1.2meters) high. Everything was on 9 track tapes. (We did have one 7 track drive for a specific customer for input, but never output anything on that drive. The tapes for remote storage would get loaded in a vehicle and driven to a storage facility about 1 mile(1.6km) away. They would then bring back the oldest set of backups to re-entered in the pool of available tapes.
During the course of my employment there we never had to use any of the backups. In my opinion the one thing we were missing was a periodic check of the integrity on the tapes. When they were written, it was always with the verify option on, so maybe the powers that be figured it was not necessary to actually test any of the tapes brought back from storage.
Occasionally you would get a write error when producing a backup and the solution was to just junk that tape as I guess they were cheap enough.
So much for a trip down memory lane.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Mar 15, 2023 8:36:40 PM]

binventive
Cruncher
Joined: May 3, 2007
Post Count: 13
Status: Offline
Project Badges:

14 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

180 day badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Has any thought been given by Krembil to improving the reliability, stability, and overall production-readiness of the various components of the application and its associated hardware and network components? It seems that there is little redundancy overall (although I was pleased to hear that drive(s) were configured via RAID (hopefully at least RAID level 1 (mirroring)). Also, when an issue occurs late in the week, nobody is available apparently to investigate/resolve the issue over the weekend. It feels like a configuration overall where errors and downtime are resolved without necessarily a sense of urgency. The instability and downtime leads to frustration and a lack of faith amongst at least some of us who donate our computer resources to supporting World Community Grid.

I am glad that there are alternative medical/biology-related distributed computing options available (e.g., Folding@Home and various BOINC-related projects such as Rosetta, TN-Grid, Denis@Home, and SiDock@Home).

----------------------------------------

[Mar 15, 2023 9:51:14 PM]

bcavnaugh
Cruncher
USA
Joined: Nov 8, 2013
Post Count: 13
Status: Offline
Project Badges:

180 day badge for The Clean Energy Project - Phase 2

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

I hate to ask but why is your system not on Raid 5 that would allow a disk failure.
At lest the main OS should be at least Raid 1 and the Database on at least Raid 5 or better.

----------------------------------------
[Edit 1 times, last edit by bcavnaugh at Mar 15, 2023 9:55:08 PM]

[Mar 15, 2023 9:53:36 PM]

Yavanius
Senior Cruncher
Antarctica
Joined: Jan 21, 2015
Post Count: 191
Status: Offline
Project Badges:

14 day badge for The Clean Energy Project - Phase 2

10 year badge for Mapping Cancer Markers

90 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

45 day badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

I've been translating all the WCG timing into Reverse Scotty and it seems to track.

"How long will it take, WCG?"
"It should only be a one-day fix, but for you Cap'n, I can do it in three weeks!"

At this point, a temporal distortion taking us back a couple of weeks would come in handy.

Go back 4 weeks and plant the idea to do a test of backing up the project to a spare unit. So 2 weeks to work out the bugs and them BAM, the main server goes down...well, that's mightily convenient we were doing a rest run of backup system...

Then we'll just grouch instead of bringing pitchforks and torches.

[Mar 15, 2023 10:46:06 PM]

Yavanius
Senior Cruncher
Antarctica
Joined: Jan 21, 2015
Post Count: 191
Status: Offline
Project Badges:


Re: Hardware Recovery Update

Double post. Forum software doesn't check if you back up to the editor.

----------------------------------------
[Edit 1 times, last edit by Yavanius at Mar 15, 2023 10:49:28 PM]

[Mar 15, 2023 10:47:41 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

50 year badge for Outsmart Ebola Together

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project


Re: Hardware Recovery Update

I hate to ask but why is your system not on Raid 5 that would allow a disk failure.
At lest the main OS should be at least Raid 1 and the Database on at least Raid 5 or better.

Because it's not a RAID problem, but a problem with accessing the controller in the first place. And one problem with RAID5 is, if you have a degraded array, by today's drive sizes, it takes forever to get in sync again.

Under IBM still, there was at some point some change to ZFS as the underlying file system. And that didn't go quite smoothly (though only a couple of days delay IIRC) but it seems that the system currently deployed at WCG is running on a more "traditional" file system.

Ralf

----------------------------------------

[Mar 15, 2023 11:06:43 PM]

Robokapp
Senior Cruncher
Joined: Feb 6, 2012
Post Count: 249
Status: Offline
Project Badges:

180 day badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for Africa Rainfall Project


Re: Hardware Recovery Update

it seems that the system currently deployed at WCG is running

[Mar 15, 2023 11:45:52 PM]

Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 87
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

1 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Computing for Clean Water

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

14 day badge for Outsmart Ebola Together

1 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Dear all - I really do appreciate your patience and your thoughts and suggestions. Indeed, many of them we have thought about and some of them we were able to implement. However, as posted before - we do not have the budget IBM was able to put into WCG - we are working on improving funding as we go.

Unfortunately, it was not "just" a disk failure - that would be fine - and it was not "only" the RAID controller failure - the bus failed -- so the whole storage unit failed.

We are grateful for your contribution of computing resources. Some of you may also be able to help WCG by introducing us to foundations or companies that may be interested to become a funding or contributing partner. We have tried and continue trying to form some form of partnerships with Dell, Lenovo, HPE, ... nVidia, AMD, Intel ... AWS, Google, -- that may help us (backend of WCG) but also projects (e.g., ARP) solving equipment challenges (and insufficient government funding for science). Cold calls do not get us far -- but opening the door may help tremendously.

Redundancy for safety and performance would be marvelous. When we were running the HCC project - funded by NIH - we had 4 backups at 3 geographically different locations on three different media. Call us paranoid - but many stories about how individual media looked great at the beginning, only to be very, very obsolete quickly.

Brief update from today:
Database filesystems are up and the data are backed up (tape). Databases are up on the temp-replacement DSS7000 storage system. Science remains down. We are going to create a copy of the science file system while we try to repair it. The RAID controllers seemed to not know about failed disks after restart or swap, and would rebuild RAIDs back to their original configuration.

Thank you
Igor

Ps. And as a reminder - please note, as we described multiple times, WCG owner is UHN, as a legal entity - they do not support WCG in any form. My group is a research lab within Krembil Research Institute, UHN, and we run WCG, as any other research project. We have to find funds for its operation. Note that WCG servers are not at UHN either, as most of my lab servers are also at Sharcnet and Scinet. Thus, it is not appropriate to blame Krembil or UHN - as it would be as accurate as blaming Toronto or Canada.

Our mission remains: Accelerating science by creating a supercomputer empowered by a global community of volunteers. THANK YOU.

[Mar 15, 2023 11:47:53 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:


Re: Hardware Recovery Update

Thus, it is not appropriate to blame Krembil or UHN - as it would be as accurate as blaming Toronto or Canada.

For that to be the case, Krembil is sure putting its name out there quite a bit. So if they actually do NOT support WCG, maybe cutting down on the use of their name, logo and their ridiculous slogan might be a good idea.

Ralf

----------------------------------------

[Mar 16, 2023 12:10:43 AM]

Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 87
Status: Offline
Project Badges:


Re: Hardware Recovery Update

well - there is optics and politics and there is reality.

I try to minimize it - but I get "reminded" that UHN needs to be mentioned.

Foremost - WCG is a volunteer platform, transforming open science through global collaboration, citizen engagement and youth outreach. My goal is to push that more (even if we cannot remove the other; at least not now).

[Mar 16, 2023 12:31:16 AM]

[ ]