Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 74
|
![]() |
Author |
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My take on all these "outages", and problems: The IBM setup of WCG, is way too complicated for this small and inexperienced Krembil/Jurisica team. They really should take WCG down, and install a "simple" vanilla BOINC system, without all the bells and whistles, which they obviously are not capable of handling. Then they could come back, and I'm sure WCG would work better. Sure, the original BOINC system, doesn't have the "fancy" webpages, with all stuff on it, but they would probably be able to handle it better than the IBM relatively complicated setup. I am not a big fan of the outages, but not all of them are Krembil's fault. At least one has been the data center. I do agree they probably bi off more than they can chew, but at least they are giving it a try. The alternative was probably to just shut down the project. I would suspect the search is continuing for partner(s) to bolster their workforce and expertise and reliability. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1957 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am not a big fan of the outages, but not all of them are Krembil's fault. At least one has been the data center. I do agree they probably bi off more than they can chew, but at least they are giving it a try. The alternative was probably to just shut down the project. I would suspect the search is continuing for partner(s) to bolster their workforce and expertise and reliability. Sorry Sarge, but here is already where the basic problem starts.Cheers In his one and only reply after the hardware crash in Feb/March, Dr.Jurisica stated that Krembil isn't involved in WCG AT ALL! Despite plastering their name all over the place. Apparently, it is UHN, which has signed at the bottom line, and is the entity dealing with any donations as well.. I have yet to see him come back and provide some more (honest!) details about Krembil's involvement. Yes, stuff can happen, nobody is contesting that. But unfortunately, there are far too many fancy stories, that don't make any sense, are being brought up over the last 14-15 months.The last two, the supposed "cluster of 260 Macs" and the "DHCP client failure" (even if this is a typo and was supposed to be "DHCP server") just don't make any sense. How can Marist college still participate without any noticeable interference and a mere 260 hosts are causing the system to run out of WUs? And the "data center outage", for which I can't find a single hint on any of the IT related sites and blogs? And how can they seriously expect to find any one willing to put up money for the project if they are so lack luster and dishonest in their communication? If they are communicating in the first place! This is something that costs very little to nothing. And timely, honest communication is a BIG problem right from the (re)start. Ralf PS: @Tigerlily Again, PLEASE, stop this moderation nonsense, it didn't make any sense months ago, it makes even less sense now. ![]() |
||
|
phillipspencer
Advanced Cruncher France Joined: Apr 9, 2015 Post Count: 71 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My take on all these "outages", and problems: The IBM setup of WCG, is way too complicated for this small and inexperienced Krembil/Jurisica team. They really should take WCG down, and install a "simple" vanilla BOINC system, without all the bells and whistles, which they obviously are not capable of handling. Then they could come back, and I'm sure WCG would work better. Sure, the original BOINC system, doesn't have the "fancy" webpages, with all stuff on it, but they would probably be able to handle it better than the IBM relatively complicated setup. I agree. Unfortunately, there is no a lot of sunk cost in the current situation, the complexity of which and resources required were under-estimated originally. Ideally someone (preferably neutral but knowledgeable) should undertake an comparative assessment of whether it is better to stop and start afresh (simple vanilla) or continue trying to improve the current situation. However, I doubt the resources or desire are there to do that. Also, maybe the answer would be different depending on which project you look at. |
||
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 609 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Does no one think IBM is not responsible (a bit to a lot) for much of this? OK! So they wanted out of WCG. We don't need to know the reason.
So they deliver this working system to Krembil but in parts ("Here's a brand new Rolls. All you need to do it assemble it"). They should have delivered a turn-key system. IBM have the knowledge and resources to install this non-vanilla system and make it work anywhere in the world. Toronto is not third world. Maybe Krembil is a bit out of their league but it is easier to learn to drive with a working car than having to assemble it first. And I'm an IBM fan since the 1401. Krembil should have taken this as is and not have wasted effort with a brand new front end on top of assembling a complex machine. You don't make a bunch of changes at once then try to figure out which is a problem when something goes wrong. |
||
|
as1981
Cruncher Joined: Dec 3, 2006 Post Count: 49 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
My take on all these "outages", and problems: ...but at least they are giving it a try. The alternative was probably to just shut down the project. (Firstly to avoid any confusion - I have only quoted part of a post.) I agree with this. My thoughts are as follows: System stability - Yes it's not ideal but this is not the only project that doesn't have tasks available 24/7. I have several projects configured in BOINC and none of them have any tasks at the moment. It doesn't mean they aren't viable projects. When ownership changes things can change just like many other types of organisation. It doesn't mean it's the new owners fault. Communication - Firstly I think in some instances this has improved. I haven't done a proper analysis but I think we are starting to see more explanation of why issues are happening and what's been done. One suggestion I would make here is that, if possible, it might be useful to provide further information on when the next update is likely to be. To give an example. We were told that data centre support is weekdays only. That's useful information because now we know if there is a data centre issue late on a Friday then there won't be any updates over the weekend. I think I'm correct in saying that we don't usually receive updates on a Monday. If that's going to be consistent and something that can be shared then perhaps it would be useful to do that. The reasons why it's not possible to update on a Monday don't necessarily need to be shared, just the fact that it won't happen. I know it's not always possible to be precise about when things will happen and when things will get fixed (I have some experience from my own job) but I think a bit more information on when updates are likely to be available would be useful if it's possible to do that. [Edit 3 times, last edit by as1981 at Aug 7, 2023 5:17:50 PM] |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Logistically, if there is a problem, the fix is not likely to be the same day. So, if a problem occurs on a Friday to Monday. the earliest we can expect a solution is Tuesday.
Mike |
||
|
as1981
Cruncher Joined: Dec 3, 2006 Post Count: 49 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
That is true.
----------------------------------------I wasn't particularly meaning posts that advise an issue has been fixed. I was thinking about posts that advise that they know about a problem and are investigating. I don't remember seeing any of those posts on a Monday when we have had weekend issues but I could be wrong. [Edit 1 times, last edit by as1981 at Aug 7, 2023 5:46:05 PM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1957 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That is true. After 15 months now with the same trott, I don't think there's going to be a chance in WCG's gait. I wasn't particularly meaning posts that advise an issue has been fixed. I was thinking about posts that advise that they know about a problem and are investigating. I don't remember seeing any of those posts on a Monday when we have had weekend issues but I could be wrong. As Paul Newman used to say "What we've got here is a failure to communicate"... ![]() Ralf ![]() |
||
|
Robokapp
Senior Cruncher Joined: Feb 6, 2012 Post Count: 249 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
we knew about the change is Sept 2021. it's August 2023. Blaming IBM is academic - at this point it should be running fine with its new owner. how long does it take the 'learn the ropes' ?
|
||
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 232 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So does anybody know where that irksome 1000 job limit is exactly in the source? My larger machine just can't download enough to keep being active over the (all too frequent) bumps in the road here. The last 55 jobs being crunched by 88 cpus all show a report deadline of August the 11th, 23:45, so it should be possible to download more and still return them on time. I feel I'm at a disadvantage with one big machine compared to many smaller ones, each downloading a 1000 jobs.
There is MAX_WU_RESULTS (which is at 100?), SELECT_LIMIT, QUERY_LIMIT, MAX_JOBS, WF_MAX_RUNNABLE_JOBS, to name a few. |
||
|
|
![]() |