Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status:
Offline
|
|
Error handling
|
Most errors that a WU may encounter have a high chance of being transient. (eg. Running out of resources or a spurious hardware error.) Even if the problem won't go away by itself, most may be fixed easily by the user. (eg. By adding an extra swap file or removing unneeded files from the hard disk.) However, BOINC aborts if any of those things happen and then starts another WU instead. This is about the worst, most inefficient, action possible. Is there any way to set it up so that it will simply go back to the previous checkpoint instead?
Very few of my WUs have ever aborted, but every one of them, apart from the HCC crashes, would have completed successfully if they had just been run again. It would definitely be much more efficient to be set up for retries.
|