Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Closed
Total posts in this thread: 196
Posts: 196   Pages: 20   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1728497 times and has 195 replies Next Thread
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7670
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Redundancy is not a word that I like to use. I prefer Backup.
Years ago, I worked for an Insurance Company that had very heavy mainframe use and all data was very heavily backed up. They were OK with the few weeks it would take to replace hardware but, when they discovered that it would take 6 months to replace the air conditioning, they built an extra suite as backup.

Mike

Years a go I also worked as a mainframe operator. We always ran 2 sets of backups, one to keep on site and one to keep offsite. Each client got their own set of backup tapes run, so in case there was only corruption on one client there would be a specific backup for them .
The stack of backup tapes would be about 4 ft(1.2meters) high. Everything was on 9 track tapes. (We did have one 7 track drive for a specific customer for input, but never output anything on that drive. The tapes for remote storage would get loaded in a vehicle and driven to a storage facility about 1 mile(1.6km) away. They would then bring back the oldest set of backups to re-entered in the pool of available tapes.
During the course of my employment there we never had to use any of the backups. In my opinion the one thing we were missing was a periodic check of the integrity on the tapes. When they were written, it was always with the verify option on, so maybe the powers that be figured it was not necessary to actually test any of the tapes brought back from storage.
Occasionally you would get a write error when producing a backup and the solution was to just junk that tape as I guess they were cheap enough.
So much for a trip down memory lane.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Mar 15, 2023 8:36:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
binventive
Cruncher
Joined: May 3, 2007
Post Count: 13
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Has any thought been given by Krembil to improving the reliability, stability, and overall production-readiness of the various components of the application and its associated hardware and network components? It seems that there is little redundancy overall (although I was pleased to hear that drive(s) were configured via RAID (hopefully at least RAID level 1 (mirroring)). Also, when an issue occurs late in the week, nobody is available apparently to investigate/resolve the issue over the weekend. It feels like a configuration overall where errors and downtime are resolved without necessarily a sense of urgency. The instability and downtime leads to frustration and a lack of faith amongst at least some of us who donate our computer resources to supporting World Community Grid.

I am glad that there are alternative medical/biology-related distributed computing options available (e.g., Folding@Home and various BOINC-related projects such as Rosetta, TN-Grid, Denis@Home, and SiDock@Home).
----------------------------------------
----------------------------------------

[Mar 15, 2023 9:51:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bcavnaugh
Cruncher
USA
Joined: Nov 8, 2013
Post Count: 13
Status: Offline
Project Badges:
Re: Hardware Recovery Update

I hate to ask but why is your system not on Raid 5 that would allow a disk failure.
At lest the main OS should be at least Raid 1 and the Database on at least Raid 5 or better.
----------------------------------------
[Edit 1 times, last edit by bcavnaugh at Mar 15, 2023 9:55:08 PM]
[Mar 15, 2023 9:53:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Yavanius
Senior Cruncher
Antarctica
Joined: Jan 21, 2015
Post Count: 191
Status: Offline
Project Badges:
Re: Hardware Recovery Update


I've been translating all the WCG timing into Reverse Scotty and it seems to track.

"How long will it take, WCG?"
"It should only be a one-day fix, but for you Cap'n, I can do it in three weeks!"

At this point, a temporal distortion taking us back a couple of weeks would come in handy.


Go back 4 weeks and plant the idea to do a test of backing up the project to a spare unit. So 2 weeks to work out the bugs and them BAM, the main server goes down...well, that's mightily convenient we were doing a rest run of backup system...

Then we'll just grouch instead of bringing pitchforks and torches.
[Mar 15, 2023 10:46:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Yavanius
Senior Cruncher
Antarctica
Joined: Jan 21, 2015
Post Count: 191
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Double post. Forum software doesn't check if you back up to the editor.
----------------------------------------
[Edit 1 times, last edit by Yavanius at Mar 15, 2023 10:49:28 PM]
[Mar 15, 2023 10:47:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Re: Hardware Recovery Update

I hate to ask but why is your system not on Raid 5 that would allow a disk failure.
At lest the main OS should be at least Raid 1 and the Database on at least Raid 5 or better.
Because it's not a RAID problem, but a problem with accessing the controller in the first place. And one problem with RAID5 is, if you have a degraded array, by today's drive sizes, it takes forever to get in sync again.

Under IBM still, there was at some point some change to ZFS as the underlying file system. And that didn't go quite smoothly (though only a couple of days delay IIRC) but it seems that the system currently deployed at WCG is running on a more "traditional" file system.

Ralf
----------------------------------------

[Mar 15, 2023 11:06:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Robokapp
Senior Cruncher
Joined: Feb 6, 2012
Post Count: 249
Status: Offline
Project Badges:
Re: Hardware Recovery Update

it seems that the system currently deployed at WCG is running


wink
[Mar 15, 2023 11:45:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 87
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Dear all - I really do appreciate your patience and your thoughts and suggestions. Indeed, many of them we have thought about and some of them we were able to implement. However, as posted before - we do not have the budget IBM was able to put into WCG - we are working on improving funding as we go.

Unfortunately, it was not "just" a disk failure - that would be fine - and it was not "only" the RAID controller failure - the bus failed -- so the whole storage unit failed.


We are grateful for your contribution of computing resources. Some of you may also be able to help WCG by introducing us to foundations or companies that may be interested to become a funding or contributing partner. We have tried and continue trying to form some form of partnerships with Dell, Lenovo, HPE, ... nVidia, AMD, Intel ... AWS, Google, -- that may help us (backend of WCG) but also projects (e.g., ARP) solving equipment challenges (and insufficient government funding for science). Cold calls do not get us far -- but opening the door may help tremendously.


Redundancy for safety and performance would be marvelous. When we were running the HCC project - funded by NIH - we had 4 backups at 3 geographically different locations on three different media. Call us paranoid - but many stories about how individual media looked great at the beginning, only to be very, very obsolete quickly.


Brief update from today:
Database filesystems are up and the data are backed up (tape). Databases are up on the temp-replacement DSS7000 storage system. Science remains down. We are going to create a copy of the science file system while we try to repair it. The RAID controllers seemed to not know about failed disks after restart or swap, and would rebuild RAIDs back to their original configuration.



Thank you
Igor

Ps. And as a reminder - please note, as we described multiple times, WCG owner is UHN, as a legal entity - they do not support WCG in any form. My group is a research lab within Krembil Research Institute, UHN, and we run WCG, as any other research project. We have to find funds for its operation. Note that WCG servers are not at UHN either, as most of my lab servers are also at Sharcnet and Scinet. Thus, it is not appropriate to blame Krembil or UHN - as it would be as accurate as blaming Toronto or Canada.

Our mission remains: Accelerating science by creating a supercomputer empowered by a global community of volunteers. THANK YOU.
[Mar 15, 2023 11:47:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Thus, it is not appropriate to blame Krembil or UHN - as it would be as accurate as blaming Toronto or Canada.
For that to be the case, Krembil is sure putting its name out there quite a bit. So if they actually do NOT support WCG, maybe cutting down on the use of their name, logo and their ridiculous slogan might be a good idea.

Ralf
----------------------------------------

[Mar 16, 2023 12:10:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 87
Status: Offline
Project Badges:
Re: Hardware Recovery Update

well - there is optics and politics and there is reality.

I try to minimize it - but I get "reminded" that UHN needs to be mentioned.

Foremost - WCG is a volunteer platform, transforming open science through global collaboration, citizen engagement and youth outreach. My goal is to push that more (even if we cannot remove the other; at least not now).
[Mar 16, 2023 12:31:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 196   Pages: 20   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread