World Community Grid - View Thread - 2023-07-31 Update (MCM1 issue resolved)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2023-07-31 Update (MCM1 issue resolved)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 74

[ ]

Author

This topic has been viewed 579779 times and has 73 replies

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7697
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 2023-07-31 Update (MCM1 issue resolved)

My take on all these "outages", and problems:

The IBM setup of WCG, is way too complicated for this small and inexperienced Krembil/Jurisica team. They really should take WCG down, and install a "simple" vanilla BOINC system, without all the bells and whistles, which they obviously are not capable of handling.
Then they could come back, and I'm sure WCG would work better. Sure, the original BOINC system, doesn't have the "fancy" webpages, with all stuff on it, but they would probably be able to handle it better than the IBM relatively complicated setup.

I am not a big fan of the outages, but not all of them are Krembil's fault. At least one has been the data center. I do agree they probably bi off more than they can chew, but at least they are giving it a try. The alternative was probably to just shut down the project. I would suspect the search is continuing for partner(s) to bolster their workforce and expertise and reliability.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Aug 7, 2023 12:39:44 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1957
Status: Offline
Project Badges:

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project


Re: 2023-07-31 Update (MCM1 issue resolved)

Sorry Sarge, but here is already where the basic problem starts.

In his one and only reply after the hardware crash in Feb/March, Dr.Jurisica stated that Krembil isn't involved in WCG AT ALL! Despite plastering their name all over the place. Apparently, it is UHN, which has signed at the bottom line, and is the entity dealing with any donations as well.. I have yet to see him come back and provide some more (honest!) details about Krembil's involvement.

Yes, stuff can happen, nobody is contesting that. But unfortunately, there are far too many fancy stories, that don't make any sense, are being brought up over the last 14-15 months.The last two, the supposed "cluster of 260 Macs" and the "DHCP client failure" (even if this is a typo and was supposed to be "DHCP server") just don't make any sense. How can Marist college still participate without any noticeable interference and a mere 260 hosts are causing the system to run out of WUs? And the "data center outage", for which I can't find a single hint on any of the IT related sites and blogs?

And how can they seriously expect to find any one willing to put up money for the project if they are so lack luster and dishonest in their communication? If they are communicating in the first place! This is something that costs very little to nothing. And timely, honest communication is a BIG problem right from the (re)start.

Ralf

PS: @Tigerlily Again, PLEASE, stop this moderation nonsense, it didn't make any sense months ago, it makes even less sense now.

----------------------------------------

[Aug 7, 2023 2:23:42 PM]

phillipspencer
Advanced Cruncher
France
Joined: Apr 9, 2015
Post Count: 71
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: 2023-07-31 Update (MCM1 issue resolved)

My take on all these "outages", and problems:

The IBM setup of WCG, is way too complicated for this small and inexperienced Krembil/Jurisica team. They really should take WCG down, and install a "simple" vanilla BOINC system, without all the bells and whistles, which they obviously are not capable of handling.

Then they could come back, and I'm sure WCG would work better. Sure, the original BOINC system, doesn't have the "fancy" webpages, with all stuff on it, but they would probably be able to handle it better than the IBM relatively complicated setup.

I agree. Unfortunately, there is no a lot of sunk cost in the current situation, the complexity of which and resources required were under-estimated originally. Ideally someone (preferably neutral but knowledgeable) should undertake an comparative assessment of whether it is better to stop and start afresh (simple vanilla) or continue trying to improve the current situation. However, I doubt the resources or desire are there to do that.
Also, maybe the answer would be different depending on which project you look at.

[Aug 7, 2023 3:06:24 PM]

BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 609
Status: Offline
Project Badges:

100 year badge for Mapping Cancer Markers

2 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: 2023-07-31 Update (MCM1 issue resolved)

Does no one think IBM is not responsible (a bit to a lot) for much of this? OK! So they wanted out of WCG. We don't need to know the reason.

So they deliver this working system to Krembil but in parts ("Here's a brand new Rolls. All you need to do it assemble it").

They should have delivered a turn-key system. IBM have the knowledge and resources to install this non-vanilla system and make it work anywhere in the world. Toronto is not third world. Maybe Krembil is a bit out of their league but it is easier to learn to drive with a working car than having to assemble it first.

And I'm an IBM fan since the 1401.

Krembil should have taken this as is and not have wasted effort with a brand new front end on top of assembling a complex machine. You don't make a bunch of changes at once then try to figure out which is a problem when something goes wrong.

[Aug 7, 2023 3:30:32 PM]

as1981
Cruncher
Joined: Dec 3, 2006
Post Count: 49
Status: Offline
Project Badges:

14 day badge for OpenPandemics - COVID-19


Re: 2023-07-31 Update (MCM1 issue resolved)

My take on all these "outages", and problems:

...but at least they are giving it a try. The alternative was probably to just shut down the project.

(Firstly to avoid any confusion - I have only quoted part of a post.)

I agree with this. My thoughts are as follows:

System stability - Yes it's not ideal but this is not the only project that doesn't have tasks available 24/7. I have several projects configured in BOINC and none of them have any tasks at the moment. It doesn't mean they aren't viable projects. When ownership changes things can change just like many other types of organisation. It doesn't mean it's the new owners fault.

Communication - Firstly I think in some instances this has improved. I haven't done a proper analysis but I think we are starting to see more explanation of why issues are happening and what's been done.

One suggestion I would make here is that, if possible, it might be useful to provide further information on when the next update is likely to be.

To give an example. We were told that data centre support is weekdays only. That's useful information because now we know if there is a data centre issue late on a Friday then there won't be any updates over the weekend.

I think I'm correct in saying that we don't usually receive updates on a Monday. If that's going to be consistent and something that can be shared then perhaps it would be useful to do that. The reasons why it's not possible to update on a Monday don't necessarily need to be shared, just the fact that it won't happen.

I know it's not always possible to be precise about when things will happen and when things will get fixed (I have some experience from my own job) but I think a bit more information on when updates are likely to be available would be useful if it's possible to do that.

----------------------------------------
[Edit 3 times, last edit by as1981 at Aug 7, 2023 5:17:50 PM]

[Aug 7, 2023 5:06:38 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12436
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Africa Rainfall Project


Re: 2023-07-31 Update (MCM1 issue resolved)

Logistically, if there is a problem, the fix is not likely to be the same day. So, if a problem occurs on a Friday to Monday. the earliest we can expect a solution is Tuesday.

Mike

[Aug 7, 2023 5:38:51 PM]

as1981
Cruncher
Joined: Dec 3, 2006
Post Count: 49
Status: Offline
Project Badges:


Re: 2023-07-31 Update (MCM1 issue resolved)

That is true.

I wasn't particularly meaning posts that advise an issue has been fixed. I was thinking about posts that advise that they know about a problem and are investigating. I don't remember seeing any of those posts on a Monday when we have had weekend issues but I could be wrong.

----------------------------------------
[Edit 1 times, last edit by as1981 at Aug 7, 2023 5:46:05 PM]

[Aug 7, 2023 5:45:08 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1957
Status: Offline
Project Badges:


Re: 2023-07-31 Update (MCM1 issue resolved)

After 15 months now with the same trott, I don't think there's going to be a chance in WCG's gait.

As Paul Newman used to say "What we've got here is a failure to communicate"... sad

Ralf

----------------------------------------

[Aug 7, 2023 6:58:50 PM]

Robokapp
Senior Cruncher
Joined: Feb 6, 2012
Post Count: 249
Status: Offline
Project Badges:

2 year badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries


Re: 2023-07-31 Update (MCM1 issue resolved)

we knew about the change is Sept 2021. it's August 2023. Blaming IBM is academic - at this point it should be running fine with its new owner. how long does it take the 'learn the ropes' ?

[Aug 8, 2023 4:28:59 AM]

thunder7
Senior Cruncher
Netherlands
Joined: Mar 6, 2013
Post Count: 232
Status: Offline
Project Badges:

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

1 year badge for Uncovering Genome Mysteries

50 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

50 year badge for OpenPandemics - COVID-19


Re: 2023-07-31 Update (MCM1 issue resolved)

So does anybody know where that irksome 1000 job limit is exactly in the source? My larger machine just can't download enough to keep being active over the (all too frequent) bumps in the road here. The last 55 jobs being crunched by 88 cpus all show a report deadline of August the 11th, 23:45, so it should be possible to download more and still return them on time. I feel I'm at a disadvantage with one big machine compared to many smaller ones, each downloading a 1000 jobs.

There is MAX_WU_RESULTS (which is at 100?), SELECT_LIMIT, QUERY_LIMIT, MAX_JOBS, WF_MAX_RUNNABLE_JOBS, to name a few.

[Aug 8, 2023 4:50:55 AM]

[ ]