mentby.com
Blog | Jobs | Help | Signup | Login

Unable to start new processes



Hello all,

You'll have to bear with me here; these are remote machines I don't
have physical access to, but a co-worker does so I'll be relaying most
requests for info or commands to be run through him when I can't
access them remotely.

I have fully-updated 10.04 installed on 12 different machines, all
Intel D945GCLF2's with 1GB of RAM and Micron eUSB 4GB flash drives
formatted ext3. After installing I would periodically run in to
problems where the devices would stop allowing me to SSH in.

From my machine the problem manifests itself as an inability to
request much in the way of data from the remote machine, for instance,
when I SSH in (ssh -v) it opens a connection, attempts to negotiate a
session (I get a response from the remote machine), but then promptly
closes the connection remotely before I get prompted for a password.
Likewise for the running instance of Tomcat, I'll connect to the http
port, it will accept my connection, but before I get anything back it
closes the connection on me. I can ping the remote machine, it shows
ports as open, I just can't seem to get any data.

From what I understand on the remote side of things, you can no longer
get any useful information from the machine. Every command the user
types in returns immediately to a new bash prompt and as a result,
issuing the 'reboot' command does nothing, as if it can no longer
start any new processes. This is making troubleshooting *extremely*
difficult as I can't figure out a way to get anything from the machine
while it's on. From what I can tell though, processes that are already
running remain running, for instance if I get someone to power-cycle
the machine, I can log in and initiate a VPN connection back to my
local machine, this remains active even if I can no longer SSH in to
the remote machine. Another example, snmp traps continue to be sent
periodically from the remote machine to my desktop (contain the system
uptime and sensor data from a serial-connected device).

The problem manifests itself periodically, I can't seem to establish a
pattern to it, but it happens to all of them eventually. If I don't
get someone to power-cycle the downed units, inside of two or three
days the majority (if not all) are unresponsive. I really have no idea
what the problem is here, nor how I might go about troubleshooting it
so I'm open to suggestions as to how to proceed.

Thanks in advance,
Chris


Chris MacDonald Mon, 23 Aug 2010 20:32:11 -0700

It sounds to me that you may be running out of memory, although I'd
expect to see an error message from the shell saying it could not fork
or exec.

If some process does have a memory leak, you may be able to track it by
regularly getting a full process listing (say twice a day) and watching
the VSS and RSS of your processes. Also run free(1) periodically and see
if available memory is falling, and if the problem starts occur shortly
after low memory.


Cameron Hutchison Mon, 23 Aug 2010 22:14:43 -0700

I've added a peek at 'ps auxc' for apps with non-zero memory usage and
'free -m' to my watchdog script that loops though each remote host at
an interval of every two minutes or so, we'll see if anything
materializes. Thanks for the suggestion, I'll post back with results
shortly.

Chris


Chris MacDonald Mon, 23 Aug 2010 23:32:26 -0700

It appears you have a problem with ssh. Please give details on
how you have set up ssh. You should have zero problems using Tomcat on
the remote machine.

73 Karl

--

    Karl F. Larsen, AKA K5DI
    Linux User
    #450462    http://counter.li.org.
         Key ID = 3951B48D


Karl F. Larsen Tue, 24 Aug 2010 04:10:10 -0700

Another interesting point you mention is the directed nature of the
connections having problems (remote -> you okay but not the other way
around). Although ICMP echo requests are returning okay I'm wondering
if there is something messing around with TCP activity.

As an example I semi-recently had a box dying on a process with two
many files open. The network weenies had broken the ACLs in such a way
that any TCP packets with the FIN flag set were not coming back
properly and all network processes to the remote machine got stuck in
an FIN_WAIT state and never closed the connection until the denial of
service eventually occurred.

It would be worth saving an exhaustive tcpdump on both ends of the
connection (just dump to file with no filtering) and examining it in
wireshark to check if the packets are being generated and reaching the
two sides etc.

James


James Hogarth Tue, 24 Aug 2010 04:53:30 -0700

Chris,

First let's take care of this.
Wrong, wrong, so wrong it's stupid.

It looks like you cannot spawn any new processes. This can happen
because of a couple of main reasons. First being the ulimits being
reached. Typical Ubuntu installation does not have any limits on the
amount of memory & processes a user can consume. You can check the
limits by executing "ulimit -a". With the information given this
sounds like a memory leak where the server is starved and any new
processes are being killed. One other possibility is breaching the max
amount of open files. You can use various tools to check these. My
favourite is nmon, you can also use sar for checking cpu usage stats.

The best action is figuring out what's running on your server and how
do they behave as time goes. Nmon's capacity planning will give you
the necessary overview although you might like to collect more data.

One other thing to check is if your applications are consuming too
many ports! You might like to have a look at
net.ipv4.ip_local_port_range configuration you have. Regardless, This
is usually quite a high range, if this is happening, you have an other
problem like your processes not closing their ports after in use.

Reducing the amount of memory allocated to Tomcat might be a starting
point since that's the process most likely ballooning and leaking.
Also look for OOM killer in the message files.
--
Hakan (m1fcj) -  http://www.hititgunesi.org

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Hakan Koseoglu Tue, 24 Aug 2010 05:02:44 -0700

Ok, sorry for the absolutely horrendous delay, but I've finally got a
few test machines set up to hopefully solve this.

See my original post for a description of the problem, it still
applies. I'm seeing this problem on multiple instances of the same
hardware. The machines are all D510MOs with 1GB of ram and a 4GB USB
flash drive that is host to Ubuntu, previously 10.04, but now 10.10
and the errors persist. I captured the error (pasted below) over the
serial port, but I'd seen it once before and it occurred at a
different sector. I've restarted the machine and I'm sure it will
crash again within a day or two. I'm also setting up another machine,
exact same hardware, I'll see if that fails too.

I'm starting to think this is a systemic hardware fault somewhere, but
if anyone knows their kernel debug-fu I'd be happy to give something a
try at my end to hopefully narrow the focus a bit.

[266929.048995] end_request: I/O error, dev sda, sector 776208
[266929.065740] Buffer I/O error on device sda1, logical block 96770
[266929.084033] Buffer I/O error on device sda1, logical block 96771
[266929.102321] Buffer I/O error on device sda1, logical block 96772
[266929.120692] end_request: I/O error, dev sda, sector 3490352
[266929.137686] Aborting journal on device sda1-8.
[266929.137744] EXT4-fs (sda1): ext4_da_writepages: jbd2_start: 8189
pages, ino 27933; err -30
[266929.137760] EXT4-fs (sda1): ext4_da_writepages: jbd2_start: 8168
pages, ino 27845; err -30
[266929.137770] EXT4-fs (sda1): ext4_da_writepages: jbd2_start: 8168
pages, ino 27951; err -30
[266929.137779] EXT4-fs (sda1): ext4_da_writepages: jbd2_start: 8168
pages, ino 27953; err -30
[266929.137787] EXT4-fs (sda1): ext4_da_writepages: jbd2_start: 8168
pages, ino 27956; err -30
[266929.276439] JBD2: I/O error detected when updating journal
superblock for sda1-8.
[266929.276542] EXT4-fs error (device sda1): ext4_journal_start_sb:
Detected aborted journal
[266929.276555] EXT4-fs (sda1): Remounting filesystem read-only
[266929.340709] journal commit I/O error
[266930.839068] EXT4-fs error (device sda1): ext4_find_entry: inode
#32023: (comm java) reading directory lblock 0
[266933.157820] sd 3:0:0:0: [sdb] Assuming drive cache: write through
[266933.176765] EXT4-fs error (device sda1): ext4_find_entry: inode
#312: (comm rsyslogd)
[266933.178520] sd 3:0:0:0: [sdb] Assuming drive cache: write through
[266933.222163] sd 3:0:0:0: [sdb] Assuming drive cache: write through
[266933.223117] EXT4-fs error (device sda1): ext4_find_entry: inode
#338: (comm udevd) reading directory lblock 0
[266971.032873] EXT4-fs error (device sda1): ext4_find_entry: inode
#32259: (comm postmaster) reading directory lblock 0
[266971.067016] EXT4-fs error (device sda1): ext4_find_entry: inode
#32259: (comm postmaster)
[266971.068947] EXT4-fs error (device sda1): ext4_find_entry: inode
#312: (comm rsyslogd) reading directory lblock 0
[266971.069077] EXT4-fs error (device sda1): ext4_find_entry: inode
#312: (comm rsyslogd) reading directory lblock 0
[266971.156148] EXT4-fs error (device sda1): ext4_find_entry: inode
#32018: (comm postmaster)
[266971.157334] EXT4-fs error (device sda1): ext4_find_entry: inode
#312: (comm rsyslogd) reading directory lblock 0
[266971.212704] EXT4-fs error (device sda1): ext4_find_entry: inode
#32259: (comm postmaster) reading directory lblock 0
[266973.229713] EXT4-fs error (device sda1): ext4_find_entry: inode
#2: (comm cron) reading directory lblock 0
[266973.259137] EXT4-fs error (device sda1): ext4_find_entry: inode
#6090: (comm cron) reading directory lblock 0
[267024.440346] EXT4-fs error (device sda1): ext4_find_entry: inode
#1611: (comm java) reading directory lblock 0
[267024.470643] EXT4-fs error (device sda1): ext4_find_entry: inode
#31906: (comm java) reading directory lblock 0
[267084.518307] EXT4-fs error (device sda1): ext4_find_entry: inode
#144807: (comm java) reading directory lblock 0
[267084.548999] EXT4-fs error (device sda1): ext4_find_entry: inode
#144802: (comm java) reading directory lblock 0
[267084.579665] EXT4-fs error (device sda1): ext4_find_entry: inode
#144801: (comm java) reading directory lblock 0
[267525.944731] EXT4-fs error (device sda1): ext4_find_entry: inode
#12: (comm ntpd) reading directory lblock 0
[270010.787208] EXT4-fs error (device sda1): ext4_find_entry: inode
#12: (comm ntpd) reading directory lblock 0
[270811.644198] EXT4-fs error (device sda1): ext4_find_entry: inode
#32245: (comm java) reading directory lblock 0

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Chris MacDonald Thu, 04 Nov 2010 17:21:31 -0700

Just to clarify, in my original post I made reference to devices in
the field... I've downgraded those D945GCLF2 boards to 9.10 and
they're fine. I'm experiencing the same symptoms here at my desk with
a slightly different board (the D510MO), same flash drive, same RAM.

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Chris MacDonald Thu, 04 Nov 2010 17:26:18 -0700

Not to be too pedantic, but if you bottom post more consistently it
will be a lot easier to follow what you're saying.

Mark


Mark Hull-Richter Thu, 04 Nov 2010 18:42:52 -0700

Well, that there is your error right there.  Theoretically speaking, I
suppose it's possible that there is some kind of quirk in those
motherboards that, combined with Linux USB drivers, gets data errors
when reading/writing to the Memory sticks.

However, given the reproducible nature of your troubles, I think I can
take the error message at face value,,, when faced with the demands of
hosting a constant use filesystem, sectors of your flash ram are simply
getting worn out.  As a rule of thumb, USB flash drives are cheap simple
devices, unlike hard drives, or even SSD drives, they have neither wear
level algorithms nor error correction.  It's not all surprising if you
to the same sectors eventually start generating errors.

Keep in mind that, if you do not specify noatime in the filesystem mount
options, files which are accessed constantly are also getting their
'access time' constantly updated.  And on a Journaled filesystem, those
meta-data updates generate even more io for the journaling.  This is not
the the kind of use Flash Drives are meant to be put too.


rashkae Thu, 04 Nov 2010 18:45:36 -0700

Apologies for top-posting. I keep forgetting about that here... it's
one of the few places I've seen it 'enforced'.

I'll set up a machine running with noatime and see how it compares to
the others. If it's indeed a problem with bad sectors then I would
expect the error to be less frequent, if it should occur at all.

I should have mentioned the flash drives again in my more recent post,
they're these:
http://www.micron.com/products/solid_state_storage/eusb.html

or more specifically
http://www.micron.com/document_download/?documentId=575

I wasn't involved in procurement but anything I read there leads me to
believe they're ok for use as an OS drive and employ some manner of
both error correction and wear levelling. However, a trip to Micron's
website has made me aware of an app they have for pulling bad block
count information off the module; I'll contact them and see where that
takes me.

Thanks for the input all, this has given me a few more things to try.

Chris

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Chris MacDonald Thu, 04 Nov 2010 19:18:36 -0700

You're right, those are not at all what I was thinking, and should
certainly be up to the task at hand.  However, the errors in your dmesg
are what they are.  The only thing to do now is narrow down whether the
io errors originate in those devices or the MB.


rashkae Thu, 04 Nov 2010 19:39:18 -0700

In general, my experience has been that errors that are consistent at
the same location tend to be the more local (though not always).  IOW,
if it is the same location on the "disk" that gets the errors
consistently, chances are higher that it is the "disk" that's at
fault, and not the m/b.

Motherboard errors usually manifest either in much stranger ways (like
random errors that don't seem to have a pattern other than failure) or
much more consistent and highly predictable ways (like, say, a whole
stripe of disk that is consistently bad unless you move it to another
interface, as if a path through the i/o port was bad for all accesses,
or it won't power up).  There are also cracked chip/solder issues
which are a little harder to pin down but they are in some sense more
predictable because of their patterns.

Start as close as possible to the problem and fan out from there.  In
this case, it looks like it's your flash drive.  One simple
verification of this would be to try it on a different host.

Mark

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listino/ubuntu-users


Mark Hull-Richter Thu, 04 Nov 2010 20:09:37 -0700

When I'd originally posted about the problem, it was occurring on
D945GCLF2s, same RAM, same flash drives, but 10.04. On that exact same
hardware I've installed 9.10 and haven't had a problem, those machines
have been online for over two months straight. Running 10.04, I'd be
lucky if they'd last two days. These were also formatted ext3 (both
9.10 and 10.04). Now I'm testing on D510MOs, same RAM, same flash
drives, 10.10, ext4 and I'm experiencing the same problem.

Part of the reason I started on the Ubuntu users list was because the
error was only appearing with 10.04. Now with 10.10 producing errors
as well, I'm still inclined to believe it was something in the kernel
changing after 9.10. I'm setting up a test that will put a platter HDD
in an enclosure on the USB bus to see if perhaps it's something in the
kernel's usb implementation that disagrees with the hardware.

In any event, I've got some test cases to work on, I'll set those up
to run tomorrow and I'll post back as soon as something fails. :)

Chris

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Chris MacDonald Thu, 04 Nov 2010 21:41:26 -0700

Wasn't dpkg changed in 10.04 to do a sync() after every package
the filesystem.  Enough to trigger errors?  I don't know.

I don't suppose those USB drives support SMART and would tell you the
number of erase cycles they've experienced?  (System -> Administration
-> Disk Tool)

Marius Gedminas
--
"Actually, the Singularity seems rather useful in the entire work avoidance
field. "I _could_ write up that report now but if I put it off, I may well
become a weakly godlike entity, at which point not only will I be able to
type faster but my comments will be more on-target."        - James Nicoll


Marius Gedminas Fri, 05 Nov 2010 05:45:57 -0700

Ok, first round of testing done. I had three setups, each running
10.10 on an Intel D510MO with 1GB of RAM. The first ('A') was my
control, it had the 4GB Micron eUSB module with default filesystem
options, it crashed in roughly 40 hours. 'B' was still a 4GB eUSB but
I set noatime in the fstab to test out Rashkae's idea, this failed in
24 hours. 'C', and the most interesting, is a SATA hard drive I
plugged in to a SATA-to-USB caddy (05e3:0718), then in to the D510MO,
this failed in 62 hours.

The times I'm not really concerned about for the moment. The
installation on all of them was the same, but there is too much going
on to put much value in to the times. The fact that C failed I think
points me in the direction I need to go; there's something off about
the USB implementation on this motherboard, or a change that was made
to the kernel between 9.10 and 10.10. Does this seem reasonable?

Chris

--
ubuntu-users mailing list
ubuntu-users*******
Modify settings or unsubscribe at:  https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


Chris MacDonald Mon, 08 Nov 2010 10:03:37 -0800

It would not have been my first guess, but this certainly seems to be
where things are leading.  The USB bus would not have been on my first
list of suspects simply because it's such a fairly ubiquitous things.  
There are only a small handful of chipsets that are used on Mobo's to
offer USB, and as far as I know, are all well supported and tested.

I guess my next step would be to compile a fresh kernel from linux.org
and see if the failure persists, and if so, ask around in the kernel
mailing list for suggestions.

Though, just to be on the safe side, I would also prepare one of your
test systems with the sata drive conncted to a sata port, to verify that
that system remains alive. (ie, make sure to narrow down the fault to USB)


rashkae Mon, 08 Nov 2010 15:42:01 -0800



Related Topics

Post a Comment