World Community Grid - View Thread - Std::bad_alloc Error [Resolved: Not enough physical RAM]

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Std::bad_alloc Error [Resolved: Not enough physical RAM]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 12

[ ]

Author

This topic has been viewed 5209 times and has 11 replies

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 806
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

20 year badge for Mapping Cancer Markers

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Std::bad_alloc Error [Resolved: Not enough physical RAM]

All but one of my ARP1 work units have been successful. Only one error so far on a Debian 10 device. What's this error mean?

ARP1_ 0002106_ 000_ 0--

Result Name: ARP1_ 0002106_ 000_ 0--
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63) </message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[03:03:43] INFO: Checkpoint taken at 2018-07-01_06:00:00
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort called
Stack trace (30 frames):
[0x2d13b72]
[0x2da0400]
[0x2da02cb]
[0x2ebbc88]
[0x2d82385]
[0x2d35a96]
[0x2d35ac3]
[0x2d35153]
[0x2d34c3d]
[0x2d7c869]
[0x2d7ca26]
[0x2d7cfee]
[0x2d312fd]
[0x2d31a98]
[0x2d31cf6]
[0x2d30444]
[0x2d2d57e]
[0x2d2c4a9]
[0x43b9c3]
[0x13066b5]
[0x12ff156]
[0x584823]
[0x584ece]
[0x584ece]
[0x448f61]
[0x4475c9]
[0x440967]
[0x2eb2344]
[0x2eb25c1]
[0x405466]

Exiting...

</stderr_txt>
]]>

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 4 times, last edit by hchc at Feb 6, 2020 7:36:38 PM]

[Oct 30, 2019 11:31:41 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7687
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: Std::bad_alloc Error

From the BOINC WIKI :
Process exited with code 193 (0xc1, -63)

Process exited with code 193 is a segmentation violation error.

You either have problems with your memory or swap file, or the application attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system).

Report this problem on the project forums of the application you have the problem with, as it may well be an error in the application's code. When you have multiple of these in a row, completely exiting BOINC and restarting it will most times fix this problem.

Original writer Original FAQ Date
Jorden 238 26-08-2007

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at Oct 31, 2019 12:53:39 AM]

[Oct 31, 2019 12:52:27 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 806
Status: Offline
Project Badges:


Re: Std::bad_alloc Error

This machine is a dual core Pentium E5800 running off a 16 GB USB flash drive. I think the computer only has USB 2.0, while the drive is USB 3.0.

Both of the work units were at the 99% mark and doing their final compression and checkpointing, and they both error'd out at the last second again. They were both at the ~36 hour mark and pretty much complete! crying

Result Name: ARP1_ 0006901_ 001_ 0--
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[13:50:49] INFO: Checkpoint taken at 2018-07-03_06:00:00
[18:58:13] INFO: Checkpoint taken at 2018-07-03_12:00:00
[00:21:32] INFO: Checkpoint taken at 2018-07-03_18:00:00
[04:51:55] INFO: Checkpoint taken at 2018-07-04_00:00:00
[09:07:18] INFO: Checkpoint taken at 2018-07-04_06:00:00
[14:28:07] INFO: Checkpoint taken at 2018-07-04_12:00:00
[20:08:58] INFO: Checkpoint taken at 2018-07-04_18:00:00
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort called
Stack trace (25 frames):
[0x2d13b72]
[0x2da0400]
[0x2da02cb]
[0x2ebbc88]
[0x2d82385]
[0x2d35a96]
[0x2d35ac3]
[0x2d35153]
[0x2d34c3d]
[0x2d7c869]
[0x2d31281]
[0x2d31a98]
[0x2d31cf6]
[0x2d30444]
[0x2d2d57e]
[0x2d2c4a9]
[0x43b9c3]
[0x1304efe]
[0x5856be]
[0x448f61]
[0x4475c9]
[0x440967]
[0x2eb2344]
[0x2eb25c1]
[0x405466]

Exiting...

</stderr_txt>
]]>

Looking at the Event Log:

12/19/2019 5:47:47 AM | World Community Grid | Computation for task ARP1_0006901_001_0 finished
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_0 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_1 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_2 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_3 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_4 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:47 AM | World Community Grid | Output file ARP1_0006901_001_0_r1905503062_5 for task ARP1_0006901_001_0 absent
12/19/2019 5:47:49 AM | World Community Grid | Computation for task ARP1_0008519_001_1 finished
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_0 for task ARP1_0008519_001_1 absent
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_1 for task ARP1_0008519_001_1 absent
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_2 for task ARP1_0008519_001_1 absent
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_3 for task ARP1_0008519_001_1 absent
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_4 for task ARP1_0008519_001_1 absent
12/19/2019 5:47:49 AM | World Community Grid | Output file ARP1_0008519_001_1_r330219911_5 for task ARP1_0008519_001_1 absent

I think with both ARP1 work units doing their final compression of output files or whatever at the same time, there is a LOT of disk activity, and since this USB3 flash drive runs at USB2, it's even slower. I think the reason for these errors is BOINC loses patience with the disk I/O and thinks that files no longer exist, when the disk is simply busy working.

Sucks that I lost two complete work units at 36 hours each. I could potentially put an SSD or even an older spinning HDD into this computer, but it's not worth either the cost or effort. I dunno, as most ARP1 tasks complete without error.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 2 times, last edit by hchc at Dec 20, 2019 3:50:18 AM]

[Dec 19, 2019 8:42:26 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12426
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project


Re: Std::bad_alloc Error

Just a thought, but these 2 were finishing at almost the same time. Did your previous successful units occur at the same time as others or individually?

Arp1 has large files so it could be that doing 2 at the same time gave your machine indigestion if you get my drift.

Mike

[Dec 20, 2019 1:50:38 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7687
Status: Offline
Project Badges:


Re: Std::bad_alloc Error

Just a thought here too. It just be the coincidence of the units finishing in close proximity to each other, but I don't really know. What I do know is I have two 24 thread systems and an 8 thread system all running off of 16gb flash drives. I limit each of the systems to 4 ARP units, but I have never seen more than 3 on any one system at a time. They have all functioned so far without any errors on the ARP units. They are all USB 2 units. They may eventually wear out, but they have been going a couple of years already. I suspect it is not the hardware, but it certainly could be if the USB units might be failing.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 20, 2019 2:27:01 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 806
Status: Offline
Project Badges:


Re: Std::bad_alloc Error

Mike.Gibson said:

They normally are pretty staggered. I've never really had both be at 99% at the same time and most of the time they checkpoint at different times too.

Sgt.Joe said:

How often do you reboot? I'm thinking that might be worth doing according to your first reply in this thread, but I don't know if that really will fix the issue. Interesting that your same size flash drives running over USB2 doesn't have this problem. I wonder if there's a way to test the health of a flash drive? Kind of like a SMART test that hard drives have.

Edited to Add: I watched one of the two work units checkpointing at the 62.5% point, and the LED on the flash drive was flashing for a good 3 minutes or so. I think (as Mike Gibson says) the computer gets "indigestion" and then BOINC gets impatient when the disk doesn't respond immediately and thinks files have gone missing.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 2 times, last edit by hchc at Dec 20, 2019 4:49:54 AM]

[Dec 20, 2019 4:08:45 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2170
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for OpenPandemics - COVID-19


Re: Std::bad_alloc Error

This 'segmentation violation' makes me think of this "Segmentation violation" thread. When the error occurs, the ARP1 program exits immediately, meaning that the compression algorithm also stops at that point, which means that the compression results are unusable, they are considered 'lost', so the final output files can no longer be written and they will be empty.

Here is what you will normally see in the program's slot directory in the file 'stderr.txt':

Starting WRFMain
[01:41:46] INFO: Checkpoint taken at 2018-07-03_06:00:00
[04:08:20] INFO: Checkpoint taken at 2018-07-03_12:00:00
[06:35:09] INFO: Checkpoint taken at 2018-07-03_18:00:00
[08:20:03] INFO: Checkpoint taken at 2018-07-04_00:00:00
[10:22:31] INFO: Checkpoint taken at 2018-07-04_06:00:00
[12:56:38] INFO: Checkpoint taken at 2018-07-04_12:00:00
[15:14:59] INFO: Checkpoint taken at 2018-07-04_18:00:00
[16:57:53] INFO: Checkpoint taken at 2018-07-05_00:00:00
INFO: Simulation complete compressing output.
16:59:49 (24943): called boinc_finish(0)

I think this is what happens normally, in this order:
ARP1 algorithm runs
ARP1 algorithm finishes (after 8 checkpoints: 'Simulation complete')
ARP1's compression algorithm starts
ARP1's compression algorithm finishes
ARP1 output files are written
ARP1 result files are uploaded

As soon as anything goes wrong during one of these steps, especially the first three are crucial here, then there is nothing to upload, except empty output files.

ARP1 uses a lot of memory. After the error occurred twice on my machine, I decided to install more memory (RAM), effectively doubling the amount. Last night I had 8 ARP1s running at the same time on my machine without a hiccup. (Received 5 ARP1s yesterday at the same time (06:58:29), another one at 13:19:58 and two more at 21:24:48. While the first 5 of them have already finished, the last three are still running now.)

Result Name          Status                 Sent Time     Due / Return Time  CPUh/Spent Claimed/Granted
ARP1_0016068_001_0-- Pending Validation 12/19/19 05:58:29 12/20/19 05:49:55 23.16/23.76   971.2/0.0
ARP1_0010317_001_1-- Pending Validation 12/19/19 05:58:29 12/20/19 05:47:05 23.18/23.78   972.1/0.0
ARP1_0015077_001_1-- Pending Validation 12/19/19 05:58:29 12/20/19 04:53:18 22.30/22.85   934.2/0.0
ARP1_0015073_001_0-- Pending Validation 12/19/19 05:58:29 12/20/19 04:39:43 21.96/22.50   991.8/0.0
ARP1_0008207_001_0-- Valid              12/19/19 05:58:29 12/20/19 04:37:40 21.98/22.52 1,008.6/950.0

[Generated by wcgformat]

[Dec 20, 2019 10:15:52 AM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12426
Status: Offline
Project Badges:


Re: Std::bad_alloc Error

The main problem with arp1 is the file size. Capacity problems may occur when more than 1 tries to checkpoint or report at the same time. Checkpointing should be ok as our machines usually just take longer, but it could well be that WCG might get too impatient.

A little insight from Keith might help here!

However, I would suggest suspending a unit for a few minutes if it is getting too close to the one in front. To use another analogy, don't tailgate - put your foot on the brake of the one behind!

Mike

[Dec 20, 2019 11:37:29 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7687
Status: Offline
Project Badges:


Re: Std::bad_alloc Error

How often do you reboot?

Almost never. These machines are running Linux. The only time they get rebooted is when I have a power outage, which is not often.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 20, 2019 8:53:26 PM]

CurtisNewton
Cruncher
Joined: Feb 24, 2008
Post Count: 25
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Uncovering Genome Mysteries

90 day badge for Outsmart Ebola Together

180 day badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: Std::bad_alloc Error

std::bad_alloc is C++'s way to say "out of memory" and is typically thrown when failing to allocate dynamic memory on heap. Can either mean no more virtual memory (e.g. 2/4 GB Limit on 32bit or fragmented memory layout) or physcial memory limit (RAM / swap).

----------------------------------------
[Edit 1 times, last edit by princah5 at Dec 22, 2019 7:40:20 PM]

[Dec 22, 2019 7:39:00 PM]

[ ]