You are not logged in.
My firefox process is dying regularly and is crashing and creating core dumps. I think it has something to do with an interaction between a few of my plugins and extensions, but it's only a theory so far.
I didn't know where the corefiles went and I noticed core_pattern was a pipe!
% cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %p %u %g %s %t %e
So systemd is taking all cores instead of just the daemons/procs it runs. This is ... definitely a bit controversial for me, but I'll worry about that later. I've found the systemd-coredumpctl util, and I can find the firefox core files inside of it:
% systemd-coredumpctl
TIME PID UID GID SIG EXE
[... snipped some lines ...]
Sun 2014-04-20 23:15:33 PDT 21858 1000 100 11 /usr/lib/firefox/firefox
Thu 2014-04-24 21:55:17 PDT 10059 1000 100 11 /usr/lib/firefox/firefox
Mon 2014-04-28 16:17:37 PDT 25162 1000 100 11 /usr/lib/firefox/firefox
Tue 2014-04-29 18:14:13 PDT 5607 1000 100 11 /usr/lib/firefox/firefox
Wed 2014-04-30 13:22:20 PDT 30645 1000 100 11 /usr/lib/firefox/firefox
Ok sweet, so it's there, let's try and use it..
% systemd-coredumpctl gdb
TIME PID UID GID SIG EXE
Wed 2014-04-30 13:22:20 PDT 30645 1000 100 11 /usr/lib/firefox/firefox
Failed to retrieve COREDUMP field: No such file or directory
This *works* for corefiles other than firefox. Just call systemd-coredumpctl gdb blah and it brings up the proper gdb session. Not so for any firefox core. My next thought was to get the core file out, as maybe the firefox binary was a shell script or something, and didn't really reference the object that gdb wanted to look for.
% systemd-coredumpctl dump
TIME PID UID GID SIG EXE
Wed 2014-04-30 13:22:20 PDT 30645 1000 100 11 /usr/lib/firefox/firefox
Refusing to dump core to tty
% systemd-coredumpctl dump > ~/firefox.core
TIME PID UID GID SIG EXE
Wed 2014-04-30 13:22:20 PDT 30645 1000 100 11 /usr/lib/firefox/firefox
Failed to retrieve COREDUMP field: No such file or directory
Ok so now I'm getting upset - let's look at the systemd-coredumpctl source code
From: https://github.com/systemd/systemd/blob … ctl.c#L402
r = sd_journal_get_data(j, "COREDUMP", (const void**) &data, &len);
if (r < 0) {
log_error("Failed to retrieve COREDUMP field: %s", strerror(-r));
return r;
}
Ok, so getting the data out of the journal is failing with an ENOENT it seems.. Let's look at sd_journal_get_data:
https://github.com/systemd/systemd/blob … al.c#L1956
_public_ int sd_journal_get_data(sd_journal *j, const char *field, const void **data, size_t *size) {
JournalFile *f;
uint64_t i, n;
size_t field_length;
int r;
Object *o;
assert_return(j, -EINVAL);
assert_return(!journal_pid_changed(j), -ECHILD);
assert_return(field, -EINVAL);
assert_return(data, -EINVAL);
assert_return(size, -EINVAL);
assert_return(field_is_valid(field), -EINVAL);
f = j->current_file;
if (!f)
return -EADDRNOTAVAIL;
if (f->current_offset <= 0)
return -EADDRNOTAVAIL;
r = journal_file_move_to_object(f, OBJECT_ENTRY, f->current_offset, &o);
if (r < 0)
return r;
field_length = strlen(field);
n = journal_file_entry_n_items(o);
for (i = 0; i < n; i++) {
uint64_t p, l;
le64_t le_hash;
size_t t;
p = le64toh(o->entry.items[i].object_offset);
le_hash = o->entry.items[i].hash;
r = journal_file_move_to_object(f, OBJECT_DATA, p, &o);
if (r < 0)
return r;
if (le_hash != o->data.hash)
return -EBADMSG;
l = le64toh(o->object.size) - offsetof(Object, data.payload);
if (o->object.flags & OBJECT_COMPRESSED) {
#ifdef HAVE_XZ
if (uncompress_startswith(o->data.payload, l,
&f->compress_buffer, &f->compress_buffer_size,
field, field_length, '=')) {
uint64_t rsize;
if (!uncompress_blob(o->data.payload, l,
&f->compress_buffer, &f->compress_buffer_size, &rsize,
j->data_threshold))
return -EBADMSG;
*data = f->compress_buffer;
*size = (size_t) rsize;
return 0;
}
#else
return -EPROTONOSUPPORT;
#endif
} else if (l >= field_length+1 &&
memcmp(o->data.payload, field, field_length) == 0 &&
o->data.payload[field_length] == '=') {
t = (size_t) l;
if ((uint64_t) t != l)
return -E2BIG;
*data = o->data.payload;
*size = t;
return 0;
}
r = journal_file_move_to_object(f, OBJECT_ENTRY, f->current_offset, &o);
if (r < 0)
return r;
}
return -ENOENT;
}
Ok - now I'm officially lost. I'm way too inexperienced with systemd source code to make decent progress this route.
Either journal_file_entry_n_items returned 0 items, which is confusing because it's part of the systemd-coredumpctl listing?, or one of these calls to journal_file_move_to_object is returning -ENOENT. I can't find anywhere that this is actually true in it's call graph.. so I'm going to assume that journal_file_entry_n_items returned 0.
What does this mean? How do I fix this?
As an aside - I'd like to debug my larger firefox issue without changing how corefiles are handled in a default arch install, as that seems a bit much.. but if anyone knows how to disable the systemd corefile handling on any process not launched by systemd, but keep using it for daemons (I can totally see the need for better corefile handling with these auto-started processes) please let me know!
Last edited by codemac (2014-04-30 23:38:30)
Offline
I want to mention that I *still* haven't fixed this!
I'm at a loss of what to do, and this bug continues in systemd-coredumpctl. Would any Arch devs like to clarify why we're using this broken piece of systemd for the time being? Or am I overreacting and I'm the only one who's ever seen this?
If I fix this I will post back here, but it's looking pretty unlikely.
Offline
Exact same problem here with qemu.
[root@snuggles alexis]# systemd-coredumpctl gdb 2366
TIME PID UID GID SIG EXE
Wed 2014-06-04 21:57:01 MDT 2366 0 0 11 /usr/bin/qemu-system-x86_64
Failed to retrieve COREDUMP field: No such file or directory
Offline
Well, here's another "me too." With chrome this time:
bspar@bspararch:/x/src/ > out/Release/chrome --user-data-dir=/home/bspar/Documents/tmp --no-sandbox | tools/valgrind/asan/asan_symbolize.py| c++filt
...
zsh: abort (core dumped) out/Release/chrome --user-data-dir=/home/bspar/Documents/tmp |
zsh: done tools/valgrind/asan/asan_symbobspar@bspararch:/cacisecure/BITS/src/ > sudo systemd-coredumpctl gdb 6061
bspar@bspararch:/x/src/ > sudo systemd-coredumpctl gdb 6061
TIME PID UID GID SIG EXE
Fri 2014-06-06 00:15:38 EDT 6061 1000 1000 6 /x/src/out/Release/chrome
Failed to retrieve COREDUMP field: No such file or directory
Offline
I am not an expert, and have only played with core dumps from crashware that I write. It seems the three programs mentioned here probably have really big core files. Perhaps the system does not keep really big cores by default?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Yep, looks like that's the problem. journalctl:
Jun 06 01:11:25 bspararch systemd-coredump[10820]: Core too large, core will not be stored.
Jun 06 01:11:25 bspararch systemd-coredump[10820]: Process 10814 (chrome) dumped core.
I don't know how I missed that before... Thanks for the tip
Now, how would I work to fix that? I've done a little research on my own, but I'll have to look into it more tomorrow - it's getting late over here. And I'm not too familiar with all of ulimit, but I'm still having problems with this configuration:
bspar@bspararch:/x/BITS/src/ > ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) unlimited
-m: resident set size (kbytes) unlimited
-u: processes 2000
-n: file descriptors 4096
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 94126
-q: bytes in POSIX msg queues 819200
-e: max nice 20
-r: max rt priority 0
-N 15: unlimited
Offline
Looks like systemd is overriding login ulimit settings - which is completely unexpected.
Looking at more systemd-coredump source..
Offline
In coredump.c:
/* Make sure to not make this larger than the maximum journal entry
* size. See ENTRY_SIZE_MAX in journald-native.c. */
#define COREDUMP_MAX (767*1024*1024u)
assert_cc(COREDUMP_MAX <= ENTRY_SIZE_MAX);
FUCKING FUCK ASS.
Ok. Looks like they don't write out anything bigger than 767MiB. It's a compiled in value, there's no way to change it. So now I have to make sure no deployed systems are using this fucking coredumpctl bullshit.. because nerd-alert: most linux systems running today have more than 1 process at a time that contain that much mapped memory.
Offline
https://bugs.archlinux.org/task/40737
I waited a day and calmed down, and filed this bug.
Offline
https://bugs.archlinux.org/task/40737
I waited a day and calmed down, and filed this bug.
Thank you. On both counts.
I'm glad I nudged you in the correct direction. Well done on researching the code.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Thanks codemac and ewaller!
Offline
As a heads up, this is fixed in systemd-214-2.
cat'ing /proc/sys/kernel/core_pattern should just show the word 'core' again.
Offline
Kudos and cookies and thumbs up for everyone involved here!
Makes me a proud Arch'er, with good practice examples like this one
. Main: Intel Core i5 6600k @ 4.4 Ghz, 16 GB DDR4 XMP, Gefore GTX 970 (Gainward Phantom) - Arch Linux 64-Bit
. Server: Intel Core i5 2500k @ 3.9 Ghz, 8 GB DDR2-XMP RAM @ 1600 Mhz, Geforce GTX 570 (Gainward Phantom) - Arch Linux 64-Bit
. Body: Estrogen @ 90%, Testestorone @ 10% (Not scientific just out-of-my-guesstimate-brain)
Offline