Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 203
|
![]() |
Author |
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
This post contains all of the updates of the hardware recovery process we have been experiencing since the beginning of March. Please keep all discussion of the hardware recovery process in this thread. Thank you for your support, patience and understanding.
----------------------------------------WCG team March 31, 2023 update Data transfer update & maintenance check Earlier in the week, we ran into HDD failures while transferring the data from recovery storage to the new storage system. The issue has since been resolved and we have resumed transferring data, expecting to finish it by 5pm today. At 4pm, we will be conducting a brief maintenance on the website and forums to transfer the DB2 filesystems to our new storage system, which will result in restricted access to the website for up to 30 minutes. If all goes well, this could be the final step towards the full storage system upgrade. We evaluated the possibility of starting download of processed WUs, while not sending new WUs out. It was determined that the risk of complications that might result from doing this with incomplete information available to our scheduler and BOINC or any other unforeseen issues is too high. We have extended the deadlines for workunits that were processed and await upload to WCG. While we wait for the data transfer to finish, we are working on resolving other long standing issues such as device recognition. March 27, 2023 update Data transfer to new storage system We have started transferring all data from the recovery storage unit to our new storage system on Friday. Based on the current rate of transfer, we expect to have all data transferred/verified later this week. We will then download all processed WUs, after which we can resume sending work units to volunteers. We plan to start with MCM and OPN/OPNG; followed by ARP and then the new SCC work units. In the meantime, we have confirmed that our daily database backups for BOINC and for the website/forums are working. The databases have been recovered and transferred to the new, faster storage already. Incremental backup to tape archive has been implemented on the new storage. March 20, 2023 update New hardware While we prepare the new and improved hardware to host our databases and parallel filesystems, we have been using a temporary system provided to us by the data center. All data is confirmed intact and there has been no data loss as we continue to recover. The recovery system is a stand-in for the storage server that failed, selected for hardware compatibility to recover the data. We will not be continuing with the recovery system indefinitely, and it will be discontinued only once the new storage system has been fully installed and synced with the recovery system for a smooth handoff. BOINC database is UP The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume. Website crashes During the hardware recovery process the website has been intermittently crashing. Looking into the cause we identified bugs that only present themselves in such cases as the BOINC database being offline, and other resources unavailable as we recover the system. The website will now remain available to users in these cases or restart automatically after crashing. In the meantime, we have posted research updates from the ARP and MCM teams. We are planning on sharing more updates soon. Initial Hardware Recovery Update [Edit 2 times, last edit by Cyclops at Mar 31, 2023 7:00:44 PM] |
||
|
Bryn Mawr
Senior Cruncher Joined: Dec 26, 2018 Post Count: 344 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the update, looking forward to getting back to work :-)
|
||
|
Freewill
Cruncher United States Joined: Mar 28, 2006 Post Count: 39 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
When will upload/download of workunits be restarted?
|
||
|
Jake1402
Senior Cruncher USA Joined: Dec 30, 2005 Post Count: 181 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
When will upload/download of workunits be restarted? From Cyclops above: The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume. Based on past performance their estimates aren't very accurate.
Join the Chicago-IL-USA team!
2 AMD FX 8320/AMD R9 270X/Win 10 2 AMD FX 8320/AMD RX 560/Linux Mint 20.3 (both computers DOA) Intel Pentium G240/Win 10 |
||
|
Kirel2
Advanced Cruncher United States Joined: Sep 24, 2014 Post Count: 99 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I presume they will let us know when uploads/downloads are working again, so we just have to be patient.
----------------------------------------![]() |
||
|
xuejc1988
Cruncher Joined: Oct 28, 2008 Post Count: 1 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
With a new operator, various engineering problems are incredibly frequent.It's a shame.
|
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2154 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Deadlines have been extended Yes they have, but the extended deadlines will expire soon.
I have already extended deadlines that expires March 24, 26, and 27. So, unless you are sure that BOINC uploads and reporting are available before those dates, you'd better extend the deadlines again. |
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sounds like slow, but steady progress. At times, it's best not to rush these things, as rushing, tends to lead to steps being skipped/messed up, forcing rework. Keep up the 'good fight'.
----------------------------------------It always amuses me, when someone says "new and improved" - as it can't be both, its GOT to be one or the other. If it's 'new', it can't have been in existence before, and likewise, if it's 'improved', then it can't be new (i.e., something's got to have existed before). BUT, I know what you mean ![]() ![]() |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2154 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 946 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thank you for the update!
|
||
|
|
![]() |