World Community Grid - View Thread - Best response to local computer disaster is????

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Best response to local computer disaster is????

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 6

[ ]

Author

This topic has been viewed 375 times and has 5 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Best response to local computer disaster is????

I lost a disk in a striped SSD RAID set - had to restore from image to get back up. Since that box was running 7 active tasks and had the equivalent number "waiting to start", they all went "back in time".

So I "Reset Project".

a) Proper behaviour? It returned me a bunch of results in "detached" status...perhaps I should have allowed the restored-from-image tasks to complete and just taken the "Too lates" (or whatever)?

b) Anything else I should have done (besides apologize profusely to wingmen)?

[Sep 15, 2011 12:29:49 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Best response to local computer disaster is????

Hello ibsteve2u,
The answer is a. You told the server you would not be running the obsolete work units, dumped them without wasting computer time, and got started on new work units. This is exactly the sort of situation that 'Reset Project' is supposed to deal with.

Please award yourself a gold star!

biggrin

Lawrence

[Sep 15, 2011 10:10:10 PM]

kffitzgerald
Senior Cruncher
USA
Joined: Jan 29, 2011
Post Count: 222
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Best response to local computer disaster is????

if you are going to use raid it would be better to use a striped set with parity (raid5) granted it uses an additional drive BUT in your case all you would have had to do is replace the dead drive with no restore required. and no data would have been lost/delayed.

[Sep 16, 2011 10:41:10 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Best response to local computer disaster is????

Lawrence

Thanks! That is reassuring, especially as I have been on both the giving and receiving end of somebody pushes "a button" here, and it causes a " crying

" way over there.

[Sep 16, 2011 11:05:00 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Best response to local computer disaster is????

IMO, it should not have been an issue...I was somewhat surprised when the Microsoft backup software in Windows 7 Ultimate x64 told me that I could not skip restore of the 4-disk RAID 10 data set where BOINC/WCG is running if I wanted to restore a system image to the 2-disk RAID 0 system disk where the O/S resides.

In hindsight, I presume Microsoft defines a "system image" to include programs and/or the pagefile and/or temporary/scratch directories, all of which I had either split off or pushed off entirely onto the data set to save space and reduce I/Os on the SSD stripe set.

I conclude that it was rather rude of Intel to limit the number of SATA ports on the ICH10 to six when it is apparently obvious to anyone that you need at least seven.

[Sep 16, 2011 11:19:40 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Best response to local computer disaster is????

Hi,

Each client-server connect sets a counter ** each successful handshake, on both sides, and matches what's on the server with what's on the client. When you restored, you went to a state the servers were not in agreement with. Even if you had not reset the project and ran those jobs to the end, things would not have been in sync on first reconnect, with very probable wasted time as the resultant, i.e. you did well to just move on.

The moment the *detached* occurred, the task will have been reassigned in rush mode i.e. the wingman would be seeing a new partner reporting within 48 hours, most often quicker and maybe even sooner depending how big your lost buffer was ;>)

** This is the counter value: <rpc_seqno>14814</rpc_seqno>

--//--

----------------------------------------
[Edit 1 times, last edit by Former Member at Sep 16, 2011 11:29:12 AM]

[Sep 16, 2011 11:27:58 AM]

[ ]