You are not logged in.

#1 2012-11-10 19:47:48

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

[Solved] Intermittent Boot Failure

I power down my PC and take a day off every week. Last week when I rebooted I had a problem of "unable to find root device" and I followed forum posts and also found this FAQ: After updating my system, I get a "unable to find root device" error after rebooting and my system will no longer boot and I did the "chroot and re-generate initramfs image" procedure and then I think it failed still to boot once and then it just booted and worked.

Anyhow this week when I rebooted I got the same "unable to find root device" problem and even though I didn't think it was necessary, I did that procedure again. Then I just rebooted into my main system four times in a row until it worked. First three time I got the same error.

This is not a software problem. I mean, I'm no genius, but I think my PC has a problem "finding" the hard drive.

My fstab I don't think is necessary relevant but it's

# 
# /etc/fstab: static file system information
#
# <file system>	<dir>	<type>	<options>	<dump>	<pass>
tmpfs		/tmp	tmpfs	nodev,nosuid	0	0
UUID=29c67a48-7636-4599-affc-e826225b49c2 /home ext4 defaults 0 1
UUID=3f211853-11a0-4ac9-9332-cad016b3b521 /boot ext4 defaults 0 1
UUID=4ca1a48e-fbd3-4756-8340-d533f6c968b8 / ext4 defaults 0 1
UUID=849fbafc-f8a9-4750-8efe-ceb434f735e0 /srv ext4 defaults 0 1


/dev/sda2 /mnt/sda2 ext3 defaults 0 1
/dev/sda3 /mnt/sda3 ext3 defaults 0 1
/dev/sda4 swap swap defaults 0 0

I have an old system in /dev/sda and my main system is on /dev/sdb and referred to by the UUIDs. What happens is that it starts to boot and finds /dev/sda fine AFAICT and then it fails when it can't find /dev/sdb

Short term solution is never reboot again!!! smile

Next idea is to make a fresh install on /dev/sda and make that my main one and then who cares about /dev/sdb smile

But the real question is, what can I do about this? Seems like since it's intermittent it's hard to debug. Furthermore my hardware dealer (who I like and I trust) I think is good at what he does but he only knows Windows so I'm not sure how much he could help.

I don't know how to try to determine if it's the drive or the mobo or the cable or what.

Any ideas are appreciated.

Last edited by CaptainKirk (2012-11-24 19:22:42)

Offline

#2 2012-11-10 20:44:08

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 12,419

Re: [Solved] Intermittent Boot Failure

Are you using Grub2 or Syslinux?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Like you, I have no idea what you are doing, but I am pretty sure it is wrong...Jasonwryan
----
How to Ask Questions the Smart Way

Offline

#3 2012-11-10 21:57:18

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

I guess grub2 as I have a grub directory in /boot

I ran S.M.A.R.T. tests now but they're good:

$ sudo smartctl -l selftest /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.6.6-1-ARCH] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5262         -
# 2  Short offline       Completed without error       00%      5261         -

Full results also look good I think:

$ sudo smartctl -a /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.6.6-1-ARCH] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD502HJ
Serial Number:    S20BJ90B983233
LU WWN Device Id: 5 0024e9 206255b87
Firmware Version: 1AJ10001
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Nov 10 23:53:51 2012 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 4800) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  80) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   056   056   000    Old_age   Always       -       4378
  3 Spin_Up_Time            0x0023   083   082   025    Pre-fail  Always       -       5358
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       69
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5263
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   062   000    Old_age   Always       -       25 (Min/Max 12/38)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       69

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5262         -
# 2  Short offline       Completed without error       00%      5261         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Offline

#4 2012-11-10 23:35:15

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 12,419

Re: [Solved] Intermittent Boot Failure

Okay, I am thinking that BIOS / Grub2 are not always mapping the drives the same way from boot to boot.

Could you post your  /boot/grub/grub.cfg file?  Either that, or pastebin it and provide a link.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Like you, I have no idea what you are doing, but I am pretty sure it is wrong...Jasonwryan
----
How to Ask Questions the Smart Way

Offline

#5 2012-11-11 15:36:26

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

ewaller wrote:

Could you post your  /boot/grub/grub.cfg file?

I'm afraid not:

$ sudo ls /boot/grub -al
total 416
drwxr-xr-x 2 root root   1024 Feb 24  2012 .
drwxr-xr-x 4 root root   1024 Nov 10 21:02 ..
-rw-r--r-- 1 root root  13728 Aug  8  2011 e2fs_stage1_5
-rw-r--r-- 1 root root  11824 Aug  8  2011 fat_stage1_5
-rw-r--r-- 1 root root  10592 Aug  8  2011 ffs_stage1_5
-rw-r--r-- 1 root root  10592 Aug  8  2011 iso9660_stage1_5
-rw-r--r-- 1 root root  12800 Aug  8  2011 jfs_stage1_5
-rw-r--r-- 1 root root   1583 Feb 24  2012 menu.lst
-rw-r--r-- 1 root root  10592 Aug  8  2011 minix_stage1_5
-rw-r--r-- 1 root root  14624 Aug  8  2011 reiserfs_stage1_5
-rw-r--r-- 1 root root    512 Aug  8  2011 stage1
-rw-r--r-- 1 root root 147440 Aug  8  2011 stage2
-rw-r--r-- 1 root root 147440 Aug  8  2011 stage2_eltorito
-rw-r--r-- 1 root root  10996 Aug  8  2011 ufs2_stage1_5
-rw-r--r-- 1 root root  10080 Aug  8  2011 vstafs_stage1_5
-rw-r--r-- 1 root root  14856 Aug  8  2011 xfs_stage1_5
$ sudo find / -name grub.cfg
/var/abs/extra/grub2/grub.cfg

However today xscreensaver wouldn't let me back in--said my password failed. That never happened before. So I made a new session and then later it worked fine.

Seems clear this disk is having problems.

Offline

#6 2012-11-11 16:25:29

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 12,419

Re: [Solved] Intermittent Boot Failure

Ah Ha! (facepalm) You are using legacy Grub.

Please post your /boot/menu.lst  smile


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Like you, I have no idea what you are doing, but I am pretty sure it is wrong...Jasonwryan
----
How to Ask Questions the Smart Way

Offline

#7 2012-11-11 16:27:59

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

Yes, a trip down memory lane, eh? smile

Here you are sir:

$  more /boot/grub/menu.lst
# Config file for GRUB - The GNU GRand Unified Bootloader
# /boot/grub/menu.lst

# DEVICE NAME CONVERSIONS 
#
#  Linux           Grub
# -------------------------
#  /dev/fd0        (fd0)
#  /dev/sda        (hd0)
#  /dev/sdb2       (hd1,1)
#  /dev/sda3       (hd0,2)
#

#  FRAMEBUFFER RESOLUTION SETTINGS
#     +-------------------------------------------------+
#          | 640x480    800x600    1024x768   1280x1024
#      ----+--------------------------------------------
#      256 | 0x301=769  0x303=771  0x305=773   0x307=775
#      32K | 0x310=784  0x313=787  0x316=790   0x319=793
#      64K | 0x311=785  0x314=788  0x317=791   0x31A=794
#      16M | 0x312=786  0x315=789  0x318=792   0x31B=795
#     +-------------------------------------------------+
#  for more details and different resolutions see
#  https://wiki.archlinux.org/index.php/GRUB#Framebuffer_resolution

# general configuration:
timeout   5
default   0
color light-blue/black light-cyan/blue

# boot sections follow
# each is implicitly numbered from 0 in the order of appearance below
#
# TIP: If you want a 1024x768 framebuffer, add "vga=773" to your kernel line.
#
#-*


# (0) Arch Linux
title  Arch Linux
root   (hd1,0)
kernel /vmlinuz-linux root=/dev/sdb2 ro
initrd /initramfs-linux.img

# (1) Arch Linux
title  Arch Linux Fallback
root   (hd1,0)
kernel /vmlinuz-linux root=/dev/sdb2 ro
initrd /initramfs-linux-fallback.img

# (2) Windows
#title Windows
#rootnoverify (hd0,0)
#makeactive
#chainloader +1
#
#

title old 32 bit Arch
root (hd0,0)
kernel /vmlinuz26 root=/dev/sda2 ro
initrd /kernel26.img

Offline

#8 2012-11-11 17:42:14

Stebalien
Member
Registered: 2010-04-27
Posts: 1,218
Website

Re: [Solved] Intermittent Boot Failure

Specify your root partition by UUID:

root=UUID=4ca1a48e-fbd3-4756-8340-d533f6c968b8

https://wiki.archlinux.org/index.php/UUID#Boot_managers


Steven [ web : git ]
GPG:  327B 20CE 21EA 68CF A7748675 7C92 3221 5899 410C
Do not email: honeypot@stebalien.com

Offline

#9 2012-11-11 17:56:42

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 12,419

Re: [Solved] Intermittent Boot Failure

That ^^^

In other words, edit your /boot/menu.lst and change the lines :

kernel /vmlinuz-linux root=/dev/sdb2 ro

to

kernel /vmlinuz-linuxroot=UUID=4ca1a48e-fbd3-4756-8340-d533f6c968b8 ro

(I think that format is correct, YMMV)


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Like you, I have no idea what you are doing, but I am pretty sure it is wrong...Jasonwryan
----
How to Ask Questions the Smart Way

Offline

#10 2012-11-11 18:00:47

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

OK I will try that and see what happens. I won't be rebooting until I get everything backed up anyhow. smile

ewaller wrote:

Okay, I am thinking that BIOS / Grub2 are not always mapping the drives the same way from boot to boot.

Is there any reason to suggest that recent changes in Arch could cause this? I installed this disk and this system in January. This issue never happened until last week. I reboot at least once a week....

Offline

#11 2012-11-11 18:08:12

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 12,419

Re: [Solved] Intermittent Boot Failure

I would not think so.  The behavior at a cold boot might be different than from a warm boot.  It likely has to do with whichever drive goes ready first.  It could be a sign that one of your drives is starting to age and could be taking longer than it has been to go ready.

Edit:  Regardless, that is why one uses UUID.  It reduces the ambiguity factor to practically zero.

Last edited by ewaller (2012-11-11 18:09:13)


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Like you, I have no idea what you are doing, but I am pretty sure it is wrong...Jasonwryan
----
How to Ask Questions the Smart Way

Offline

#12 2012-11-15 18:58:56

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

A power outage caused a hard reboot and now the drive doesn't appear at all. I installed a fresh Arch on /dev/sda and it's fine. I had to boot into a CD and USB several times before I got the install to work and once or twice I did see /dev/sdb in lsblk but mostly not.

It appears in the PC's Bios display if I choose that to select boot device.

I have some files there I would like to get if possible.

Any ideas how I can convince Arch that that disk exists? smile

Offline

#13 2012-11-15 21:13:08

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

what I meant is that I used the UUID but it still says it can't find that device (and lists it by its UUID). I figured out a workaround--I just rebooted into the USB stick repeatedly (because that's the fastest way to get a shell) until lsblk did show the 2nd disk. Now I am copied over the files I would like to have. smile

I suppose I can't actually be certain if this intermittent error is on the disk or the mobo or the cable without more experimenting....

Offline

#14 2012-11-24 19:22:20

CaptainKirk
Member
Registered: 2009-06-07
Posts: 335

Re: [Solved] Intermittent Boot Failure

Appears clear, after further testing, that it's the SATA port as I switched it and now it seems OK. Thanks for the help.

Offline

Board footer

Powered by FluxBB