Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello everyone,
ok, we got freedom of speech here and everyone can discuss as much as wanted about how the "phase-out" of a project should be done. Just as done on this thread, https://secure.worldcommunitygrid.org/forums/...ad_thread,38038_offset,80 especially in the thread's end. In my view, the techs do it well and they should be given the time they need. I would not make great fuss about it. But, in my view, it would be really necessary to get good information on what I call "the hidden stepstones of workunits" Such a stepstone is the point of the workunit you have to reach if you do not want to loose all your work done in the work unit when switching your computer off. I searched the forum for the word "stepstone", not a single result. Clean Energy Phase 2 seems to have such tepstones, all the other projects seem not to have them. It is not good to crunch for 7 hours and then loose all the work done and start the same workunit all over again. At least I would like to be informed about those "hidden stepstones" in the chapter "systemrequirements" What do my dear fellow crunchers, the commnity advisers, the techs, everybody think about those stepstones and - probably - not yet existing in depth information on them? The discussion is open --- Thanks for any answers All the best and keep crunching. Martin Schnellinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
They're called 'Checkpoints' in BOINC world! When a task writes a progress step, it's called 'Checkpointing'
Re-search and thou shalleth find. :D |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello,
thanks for the answer. Let me explain why I think that the question is important, I have a machine which is on only 5 minutes a day and I will not change this. Among others, the Checkpoints seem to be the reason, why this machine always reports units with "no reply". Let me take the log of a clean energy result: Result Log Result Name: E231406_ 775_ S.300.C30H14N6O2S3.NUGDYUUXXHOABT-UHFFFAOYSA-N.7_ s1_ 14_ 1-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [04:51:57] Number of jobs = 8 [04:51:57] Starting job 0,CPU time has been restored to 0.000000. [04:51:57] Starting new Job [04:51:57] Qink name = fldman [04:51:59] Qink name = gesman [04:52:01] Qink name = scfman [05:32:31] Qink name = anlman [05:32:31] Qink name = drvman [05:35:17] Qink name = optman [05:35:17] Qink name = fldman [05:35:17] Qink name = gesman [05:35:19] Qink name = scfman [05:51:47] Qink name = anlman [05:51:47] Qink name = drvman [05:54:34] Qink name = optman [05:54:35] Qink name = fldman [05:54:35] Qink name = gesman [05:54:37] Qink name = scfman [06:09:57] Qink name = anlman [06:09:57] Qink name = drvman [06:12:40] Qink name = optman [06:12:41] Qink name = fldman [06:12:41] Qink name = gesman [06:12:43] Qink name = scfman [06:27:55] Qink name = anlman [06:27:55] Qink name = drvman [06:30:41] Qink name = optman [06:30:41] Qink name = fldman [06:30:41] Qink name = gesman [06:30:43] Qink name = scfman [06:45:53] Qink name = anlman [06:45:53] Qink name = drvman [06:48:36] Qink name = optman [06:48:37] Qink name = fldman [06:48:37] Qink name = gesman [06:48:39] Qink name = scfman [07:03:41] Qink name = anlman [07:03:41] Qink name = drvman [07:06:26] Qink name = optman [07:06:26] Qink name = fldman [07:06:26] Qink name = gesman [07:06:28] Qink name = scfman [07:21:18] Qink name = anlman [07:21:18] Qink name = drvman [07:24:01] Qink name = optman [07:24:01] Qink name = fldman [07:24:01] Qink name = gesman [07:24:03] Qink name = scfman [07:38:54] Qink name = anlman [07:38:54] Qink name = drvman [07:41:36] Qink name = optman [07:41:36] Qink name = fldman [07:41:36] Qink name = gesman [07:41:38] Qink name = scfman [07:56:45] Qink name = anlman [07:56:45] Qink name = drvman [07:59:26] Qink name = optman [07:59:27] Qink name = fldman [07:59:27] Qink name = gesman [07:59:29] Qink name = scfman [08:12:02] Qink name = anlman [08:12:02] Qink name = drvman [08:14:43] Qink name = optman [08:14:43] Qink name = fldman [08:14:43] Qink name = gesman [08:14:45] Qink name = scfman [08:26:02] Qink name = anlman [08:26:02] Qink name = drvman [08:28:44] Qink name = optman [08:28:44] Qink name = fldman [08:28:44] Qink name = gesman [08:28:46] Qink name = scfman [08:40:10] Qink name = anlman [08:40:10] Qink name = drvman [08:42:51] Qink name = optman [08:42:51] Qink name = fldman [08:42:51] Qink name = gesman [08:42:53] Qink name = scfman [08:54:57] Qink name = anlman [08:54:57] Qink name = drvman [08:57:40] Qink name = optman [08:57:40] Qink name = anlman [08:59:47] End of Job [08:59:48] Finished Job #0 [08:59:48] Starting job 1,CPU time has been restored to 13476.197905. [08:59:49] Starting new Job [08:59:49] Qink name = fldman [08:59:50] Qink name = gesman [08:59:51] Qink name = scfman [09:15:10] Qink name = anlman [09:17:15] End of Job [09:17:16] Finished Job #1 [09:17:16] Starting job 2,CPU time has been restored to 14479.463347. [09:17:16] Starting new Job [09:17:16] Qink name = fldman [09:17:17] Qink name = gesman [09:17:18] Qink name = scfman [09:31:48] Qink name = anlman [09:33:53] End of Job [09:33:54] Finished Job #2 [09:33:54] Starting job 3,CPU time has been restored to 15401.799753. [09:33:55] Starting new Job [09:33:55] Qink name = fldman [09:33:56] Qink name = gesman [09:33:57] Qink name = scfman [09:57:37] Qink name = anlman [09:59:48] End of Job [09:59:49] Finished Job #3 [09:59:49] Starting job 4,CPU time has been restored to 16891.221763. [09:59:49] Starting new Job [09:59:49] Qink name = fldman [09:59:51] Qink name = gesman [09:59:51] Qink name = scfman [10:15:05] Qink name = anlman [10:17:10] End of Job [10:17:11] Finished Job #4 [10:17:11] Starting job 5,CPU time has been restored to 17884.202977. [10:17:12] Starting new Job [10:17:12] Qink name = fldman [10:17:13] Qink name = gesman [10:17:14] Qink name = scfman [10:25:20] Qink name = anlman [10:29:24] End of Job [10:29:25] Finished Job #5 [10:29:25] Starting job 6,CPU time has been restored to 18589.232631. [10:29:26] Starting new Job [10:29:26] Qink name = fldman [10:29:35] Qink name = gesman [10:29:38] Qink name = scfman Application exited with RC = 0x100 [14:22:12] Finished Job #6 [14:22:12] Starting job 7,CPU time has been restored to 31905.805185. [14:22:12] Skipping Job #7 14:22:17 (3851): called boinc_finish Where are/were the checkpoints? Are the checkpoints identical with the finish of a single job? Does WCG have any data on the percentage of the jobs sent out and having the final status "no reply" or "too late"? Thanks, but I have a feeling that we could find a lot of lost computer power of users running their machines only 5 minutes a day. Greetings Martin Schnellinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Nothing lost, nothing gained either. Five minute devices are just not going to be able to contribute at WCG even at 1 minutes checkpoints, simply because the total runtime in 99% of the cases far exceeds deadline at your computing rate: Say a task duration is hardly ever less than 1 hour [visit http://bit.ly/WCGART1 for current averages per project], and the deadline is never longer than 10 days, how are you going to fit 12 times 5 minutes into 10 days max? Never!
Once upon a time there was this scaling dream, make tasks according to computing capability and regular on-time. If ever the dream still exists, the person with whom some shared the dream left the building in 2014. Now that BOINC development has gone under community governance, doubt much at all will happen is my thinking, but could always be wrong. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
BTW, different programmers, different significations of checkpoints. For CEP2 they are called jobs
----------------------------------------[10:29:25] Starting job 6,CPU time has been restored to 18589.232631. Other projects call them tasks or dockings or attempts e.g., but if you switch on the <checkpoint_debug> log flag, the client event log will print them uniformly as 'Checkpoint' with a timestamp. For most sciences you can influence the checkpoint frequency, but never to less than the shortest interval the application generates one. In the case of CEP2, the science does not even allow this. It only checkpoints when the job step is ready, fully ignoring any 'write to disk at most' settings in the client, for good reason. [Edit 1 times, last edit by Former Member at Jul 4, 2015 1:52:46 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello Rob,
thank you very much for your effort. The fact, that CEP2 does not want to change checkpoints can be accepted, but then you have to find a different way for the users having their machines on only five minutes a day. Do you know, who sets the time limit for the workunits? Is it maybe possible to have one series of workunits with longer limits than the others and have this series (call it batch or whatsoever) sent out to this group of "five minute users"? Do you know, how high is the percentage of workunits returning "too late"? Thank you again Martin Schnellinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Already stated that 'cut to device capability' does not exist, let alone pre-determining the duration of batches and then reserving them to only go to a subset of devices. How many 'too late' and other rates of failure I do not know, but considering that a very old statistics said that 96% came back in under 96 hours, you can derive from there how many of that 4% ever get to a too late state [we call them 'No Reply']. I'd be surprised if that even were 1%.
We just had a project run out that had a 10 day deadline [FA@H], which ran at about 700,000 results daily in the last weeks. The number of results completed after the 10 days deadline were 15,000 [~2%] and the largest part were probably repairs rather than jobs resulting from No Reply. For volunteer computing an insignificant number IMO. Back of envelope, techs might be able to give some more exacting statistics. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello Rob,
well, the amount of results too late and no reply is not very high. But, in my opinion, you have to take into account, that the machines running BOINC in the end are controlled by humans. And those humans check the efficiency of their undertakings. What I want to say,is, that many "five minute a day" users very likely just give up running BOINC completely, when they find out, that their machine always returns results too late. I am really sure if those "5 minute a day users" would just go on running BOINC not caring about all their results being too late, then the overall percentage of workunits returned to WCG "Too late" would be much higher. It is a pity that users leaving BOINC do not leave a message in this forum "Hey, I stopped runing BOINC, because I use my machine only five minutes a day and I was frustrated, because all results were "too late". How many members inactive for more than half a year are registered at WCG? This number is of significance, too, isn't it?? Well, for my side, I will go on crunching, I am just thinking strongly about the "5 minute a day" computer users. Thanks again for your attention. Greetings Martin Schnellinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello everyone,
I think I found a way, how the problem with the computer Users having their machine running only 5 minutes a day or/and having old and weak machines could be solved without any programming work. 1.Take one batch of workunits from existing projects. 2. Give them a very long time limit (one year) 3. Use this batch as a new, separate project (name: mapping cancer for users with old computers, for example 4. Simply put this project in the place of the expiring Fight aids at home project. I am not naive and think that this is feasable without any work, but following logic, it should not require too much programmig. The website is already prepared for this project. Just replace the name fight Aids at home with: Mapping Cancer Markes for users with weak computers. It should not be tot difficult, shouldn't it. Just take the occasion of the finish of Fight AIDS at home! I gurarantee you that you will have many aditional members of Worldcommunitygrid. That is all, my folks. I continue crunching as long as a can! Worldcommunitygrid is more important than Twitter and Facbook together!! I wish it would be more known. I spread the word, I assure you. But it is hard. People are afraid of viruses, unfortunately. Hackers and virus programmes out there: You do cause harm, harm to such a genius and benificiary invention as World community grid. Stop it! Offer your programming skills to Worldcommunity Grid instead!!! Sorry, but this had to be said. Bye, keep on crunching forever MS |
||
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
1. BOINC has a pause time on boot (or start_delay), which is defined in some BOINCs...so until a OS completely boots, no BOINC is asking for more memory load & pagefile writing...so "5min machines" - r not option for WCG or BOINC!
----------------------------------------2. even if u have i7-4790K, which has a 5,5GFLOPs per core (data from here: https://setiathome.berkeley.edu/cpu_list.php )...& u do load under 30s to BOINC (which is not likely unless u have SSD on those computers)...so that is about 7,5 Cobblestones or BOINC credits (a little more on WCG )...no guaranty is made that it will reach a checkpoint! but given the nature that CEP2 has a several checkpoints, on several %...reason why they have them is 'cause they DO TAKE SO LONG to calc... other projects have tried to implement similar things...recently we've done some BETAs with it... & in some research it can't be done in the middle of "binding"...so only after s binding is done 4 a whole molecule (that doesn't mean it's 100%, might be 4 example a 15,71% or 60,43% of WU), a checkpoint is made! though we welcome ur thoughts about donating CPU power..."5min devices" r not what BOINC is made 4...it was made 4 running computer power @ home (1-3h), office (~8h), school/uni (~12h) or 24/7... ![]() btw, why do u have 5min per day run time? ---------------------------------------- [Edit 1 times, last edit by KLiK at Jul 6, 2015 11:13:09 AM] |
||
|
|
![]() |