EXT4-fs error loading journal - whats best way to recover

wolfdogg · 2012-02-23 18:32:20

uh oh, i have a hard drive crash it appears. i was doing some web dev and i dont want to lose this work. yesterday i noticed the hard light was on almost if not solid. i couldnt log remtoely into ssh, it did answer and let me atleast get to the username, but the password prompt wouldnt come up. the apache server was still running to serve web files, but the WebDAV wouldnt allow file connection, hittting the space bar and mouse wouldnt awake the monitor at all, this went on for several minutes.

so i had to reset (restart). looking back, i think this is the 2nd time i messed up things by resetting the file system while it was doing god knows what.

now when it boots im getting sector and mount errors.

Failed command: Read DMA
Status DRDY ERR
ata1:00 error: { UNC }
end_request: I/O error, dev sda, sector 38762983
Failed to read block at offset 1628
EXT4-fs (sda3): error loading journal
^Zmount: wrong fs type, bad option, bad superblock on /dev/sda3
missing code page, or helper program, or other error. 
in some cases userul info is found in syslog - try dmesg | tail or so

now im dropped to an emergency shell

# dmesg  | tail

outputs almost same error

Descriptor sense data with sense descriptors (in hex)
    {followed by hex output}
...
...
...
JBD2: Failed to read block at offset 1628
JBD2: recovery failed
EXT4-fs (sda3): error loading journal

it appears i jacked it up when i reset it during file operations.

where to go from here? i was going to load a partition program just to see if it can read teh drive at all, in the attemtpt to relocate the drive to another pc to backup backup my home directory.

what kind of repair tools can i run from the emergency shell to repair?

this link might help somebody hele me figure out whats best here http://www.linuxinsight.com/ext4-root-p … rrors.html how do i make sure the partition isnt mounted, assuming it isnt is not good enough for me, since this is business data.
im thinking of running fsck

Last edited by wolfdogg (2012-02-24 17:52:34)

wolfdogg · 2012-02-23 20:03:49

after booting into the live cd,
and checking mounts with # mount
i verified that sda3 wasnt mounted
i tried to mount it to /media/newhome
mount -t ext4 /dev/sda3 /media/newhome
got the same hard drive errors that i see on boot,
ran fsck -n
no luck,

then;

i ran fsck /dev/sda3
answered yes to ignore erorrs
no to the first repair prompt, cant remember what it way but it didnt look safe, and answered yes to the rest of the repairs that were mainly dtime, block count, and bitmap differences. it modified the file system, after reboot, saw a message that it was repairing the journal, then it booted. YES!!!!!!

most of the errors were in the phpmyadmin folder, which is the app i was loading across the LAN when i noticed the problems. i wondier if it was recovering from errors when i reset the PC. man, i need to silicon that button inoperable, but then i would just hit the switch on the back of the PS.

should i boot back into the live cd and run fsck once more to check if tis clean now? especially since i skipped one repair? or should i leave well enough alone?

Edit: i ran it, wasnt going to do any repairs yet, but it comes back clean
When i logged into KDE (normally runs as apache server with no console user) just to file browse the files all looks good, but then the hard drive started to light up again, i heard it plowing through files. i logged out to see if it would stop, there was a big delay when i clicked the kicker menu, after i logged out the hd light went out mostly. after rebooting into live cd and running the fsck /dev/sda3 and it finished and reported back clean, now im noticing the HARD DRIVE LIGHT IS ON AGAIN, whats going on? Is my hard drive failing? its a WD3200KS Sata 250GB.

Last edited by wolfdogg (2012-02-24 17:53:03)

thisoldman · 2012-02-23 21:40:08

You might try 'iotop' to see exactly which processes are accessing the hard disk. It's like 'top' for IO.

wolfdogg · 2012-02-23 22:53:53

funny, trying to install it, and i was getting bad lags, waiting for command line to respond, then trying to install it, said i need newer gcc-libs, then trying to install gcc-libs update, said i needed to update pacman, so i updated pacman, then tryed to update gcc-libs, it didnt let me so then i tried install iotop, by now there was NO more lag on the command line and the hard drive light settled down, i installed iotop succesfully this time, and after running it is showing 0.00 bps on both read and write, but the hard drive light is back on in full.

so far what i did manage to get was

(al the IO items only pop up intermittently for a second, with more than 20-30 seconds between each time they show up, so its a bit monotonous to wait for each item to appear)

i found this link after doing a search for "arch jbd2" https://bbs.archlinux.org/viewtopic.php … 17#p801017 funny it didnr return an informative article on jbd2, it returned an article where someones hard drive light wont shut off, that sounds familier. im looking into seeing if HAL is running or not but i have to run, be back in an hour. any suggestions in the meantime? note HAL is not in my daemons list, so how can i see if HAL is running? and how to disable it if its not in my daemons list in rc.conf?

EDIT:
ok, using -a switch iotop -a
then pressing o when running
under io, i have

jbd2/sda3-8 (my home)
crond -S -l info
smdb -D
smdb -D
syslog-ng
nmdb -D
9 copies of httpd -k start (is this causing problems? is it trying to start 9 times in teh course of 2 hours?)
nepomukservic-pomukfilewatch (wish i could kill nepomuk for good)
flush-8:0
kdeinit4: nep~rver [kdeinit]

Last edited by wolfdogg (2012-02-24 01:38:05)

wolfdogg · 2012-02-24 01:35:54

anybody know what i should be doing to solve this?

thisoldman · 2012-02-24 10:46:04

Make sure anything important is backed up. The bottom line is that hard drives are cheap when compared to the cost of data loss. I take it that you have only one hard drive installed?

Have you checked your log files for any important messages?

'fsck' checks a filesystem on a disk, it doesn't perform a thorough hardware check. Hard drive manufacturers will often have drive checking utilities available on their website. If you don't know the manufacturer, 'hdparm -I /dev/sdX' (that's a capital 'i' as the option) may give the manufacturer and a whole lot more information to you. If it doesn't give you the manufacturer name, it should at least give you a model number and you can Google that.

Most drives are SMART-capable. Smart's reliability for prediction may not be good but it has been reliable for me when used to confirm serious problems. The command line package is 'smartmontools' and a GUI for the tools is 'gsmartcontrol'.

wolfdogg · 2012-02-24 17:58:21

the drive is a WD3200KS

The problems have migrated over to /dev/sda4 now, so it looks like i might have some hard drive failure coming on.

NO its not my only drive, its just that ll my others are full or in use. I have about 3 hard drives on each computer at least, and another 4 or 5 externals in various stages with various file systems, plus a laptop that im currently backing up some last minute files to. . either they are in repair, lost their integrity, or are in full use. i fell like im stuffing a backpack with last minute items since im backing up all the files on the /home partition (196 GB) i decided to do that after the home partition got a few corrupted files too (sda4) . im having to copy all files from an alternate machine, since the system in question is lagging too bad to use it now. In KDE i cant get the kicker menu to pull up, nor dolphin. I dont really feel like logging into gnome, no telling what will happen.

what log files should i check?

smartctl -l errors /dev/sda3 reports UNC errors at or around 20523 hrs (855 days) of operation.

I will run some tests on it to see if i should be rebuilding it, or shotgunning it. Depending on if i can get a program to flag the drive bad or not. are there any other programs that are likely to be able to report the drive bad if smart tools doesnt? Is it possible that UNC errors come from other sources other than failing mechanics on the HD, or are they indicative of a failing hard drive unanimously?

thisoldman · 2012-02-24 18:21:27

Western Digital Caviar Blue -- diagnostic programs for DOS or Windows: http://support.wdc.com/product/download … S&x=14&y=7

wolfdogg · 2012-02-25 07:25:34

thanks for that link. it seems like it was spiraling downward fast, but there has been a glimpse of hope. i ran full offline tests on both partitions that werent boot, or swap (on home and root) and it showed some errors, then i used the link you included to run the WD diagnostics program, it did a short test, said errros were found, then i ran a long test, and it said it corrected the errors. im guessing it just dirty flagged some bad sectors, i dont know what else it might have done.
backing up the files took the longest.

then i booted into livecd again, and ran fsck, it repaired journal instantly on sda3, and sda4, they all came back clean. So now i have to keep my eye for any errors, and if things didnt get fixed i should know in a hurry because it didnt take long for them to surface, literally.

now after reboot, smartctl -H /dev/sda3 and smartctl -H /dev/sda4 both pass as they did before, i will have to look further into the smart errors though to see what lingers.

right now im looking at iotop, and i see the latest file access is from;
virtuoso (being the most),
down through;
flush-8:0 (what is this?)
jbd2/sda4-8,
nepomukservicestub nepomukfilewatch,
kwin - session,
kdeinit4:
plasma desktop and konsole,
nmbd -D

and the hard drive is not solid either, its just flashing away as the lame nepomuk indexer does its thing. I looked into disabling this in the past, it was a long and weary road since it seems to be deeply embedded into KDE, (or is there a known easy way?)

So thanks for the tips, i have decisions to make, either continue on this drive, or downgrade it to video capture or something. the problem is, i dont have a clear replacement, and i need several replacements already, but finances wont permit.

Last edited by wolfdogg (2012-02-25 07:27:13)

wolfdogg · 2012-02-25 07:43:43

Hard drive has quited down, thats a change.

After pulling back up the attribs for the drive it is saying the same thing it said after both the short, and long test before i ran the WD Diagnostics and repair on it. Thats expected im guessing since the SMART is just a log, so i wouldnt expect not to see the info that has already been logged.
ill include a few of those attributes here that were reported when i ran those tests here incase anybody can help figure out their real meaning. What i want to point out is that i found out that if the TYPE attribute comes up as "pre-fail" and "oldage", its my take that it means replacement is suggested, but does NOT mean the drive is failing. Although this is counter intuitive, basically the more important attribute is "WHEN_FAILED". see the quote from the man below for more info on this...

Some of the attributes displayed for

smartctl --all /dev/sda3

Raw_Read_Error_Rate | Pre-fail
Spin_Up_Time | Pre-fail
Start_Stop_Count | Old_age
this goes on and on for 15 attributes with Pre-fail and Old_age in all of them, with Old_age being in most.

note, the WHEN_FAILED column is filled only with hyphens - therefore, there has been no failure.

so for future reference, i want to quote a portion of the smartmontools which relates to hard drive fail attributes http://smartmontools.sourceforge.net/ma … ctl.8.html

Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH". If the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed. If the Attribute is a pre-failure Attribute, then disk failure is imminent.
Each Attribute also has a "Worst" value shown under the heading "WORST". This is the smallest (closest to failure) value that the disk has recorded at any time during its lifetime when SMART was enabled. [Note however that some vendors firmware may actually increase the "Worst" value for some "rate-type" Attributes.]
The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute's current Normalized value is less than or equal to the threshold value.
If the Attribute's current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a dash: '-') then this Attribute is OK now (not failing) and has also never failed in the past.

I have a couple questions
Do i have this figured out right?
This particular hard drive is ok for stable use until i can afford to replace it? I plan to keep an eye on the SMART monitor to look for fails as well.
How would i run a proper file check, as opposedf to a file system check, because localhost/phpmyadmin isnt working, and thats where it all started. im sure the files got chewed up there, and elsewhere.

Last edited by wolfdogg (2012-02-25 07:51:22)

thisoldman · 2012-02-25 10:06:51

I'm glad things have quieted down. I'm glad my help was useful.

I was pressed for time, yesterday. or I would have mentioned that I have had a drive run reliably for months with some smartctl warnings. I believe the UNC error could occur from a power failure during a write operation.

The 'flush' that iotop shows occurs with ext4 and, I think, some other filesystems. The filesystem holds the data in a buffer or cache for a short period, until either a threshhold amount of data is reached or a certain amount of time has passed. The data is then flushed from memory for the actual physical write. 'jbd2' appears in iotop from the journalling system working.

As far as I know, you're doing all you can reasonably do to ensure data integrity with a finite amount of hard drive capacity. If we could afford it and could spare the performance hit, we'd all be mirroring partitions onto other drives in realtime but this just isn't practical.

For archive integrity, on backups that will be kept unchanged for a long period, I do keep a record of md5sums. Just to see if there's been any bit rot over time.

wolfdogg · 2012-02-25 11:27:45

thanks for the info. thats a good idea to keep md5sums, thats a fool proof way to ensure data integrity i believe.

i reinstalled phpmyadmin, its back to normal, all databases still in tact, no edits needed. and now im in that place of being unsure how long until the bomb goes off. :-)

Im going to look up tomorrow some ideas, but im thinking fsck doesnt seem to catch all the corrupted files, so im guessing it just concentrates on hard drive structure. Is there another file checker that might do something to flag the files that are still corrupted by this mishap? I know a couple video files are still corrupted, and who knows what else, and for instance the phpmyadmin installation was hosed until reinstall, so fsck didnt catch that stuff. is there a well known program that might?

also, how abotu a defrag program? i backed up and moved out about 200GB of data, so i would like to defrag whats on there first before many of the gaps are filled in sporadically. I figure its a good chance to reorder the files and get them contingent. Especially since now would be a good time to kill the hard drive if it was going to happen with all my data backed up, and once they are defragged it should make it easier on the drive into the future.

Last edited by wolfdogg (2012-02-25 11:30:46)

thisoldman · 2012-02-25 12:00:14

'photorec' in the 'testdisk' package will recover deleted files and doesn't depend on the file system. It just looks at the data on the unmounted disk. It won't restore a corrupted file, but can sometimes give a partial recovery. It works on more than just image files. Photorec recovered files are given numbers as names, it's up to the user to figure out what's what. 'testdisk' is useful for recovering partition tables.

Defragging shouldn't be much of a problem with ext4 until a disk becomes very full. Most people tar the disk contents and then untar it back to perform a disk defragmentation. Another option would be to use 'dar' or 'fsarchiver'.

I've been impressed with fsarchiver and use it when I wish to rearrange partitions. It will copy a partition's filesystem into a compressed archive that can then be used to restore the filesystem to any partition that has enough space to hold the files. All of these backup programs work to defrag because they write back to disk one file at a time.

You should check SystemRescueCD: http://www.sysresccd.org/SystemRescueCd_Homepage. It includes a lot of tools, including those I mentioned.

wolfdogg · 2012-02-29 06:11:35

well, i ran grc, it took 8 hrs to check sda3, then a whipping 72 hrs to check sda4, after that, i ran fsck again, it didnt last long. unc errors, fsck again, more unc errors,
so i reinstalled arch linux, that didnt complete, so i ran gparted to make sure and do a complete format incase it was just using sectors that i had hoped that grc flagged-dirty. so after reformat, i assumed hopefully if things didnt error out that finally the reformat set the bad sectors unusable, but just after reboot into arch setup cd after gparted reformat, i get UNC errors again.

So it looks like i went full circle. hard drive becomes a frisbee.

spinrite showed about 15-20 unusable sectors, and wasnt able to recover the data, no big deal, i backed up everything before i ran it. but shouldnt it have flagged the sectors dirty so that after a format they would be totally ignored? the reason i want to know this for sure is so i can determine wether im witnessing it degrading, or if i just need to take one more step to make sure those sectors arent accessed. its a matter of time and money, so i dont want to assume anything. would anybody happen to know?

thisoldman · 2012-02-29 10:31:02

Is your shift key broken or is it your little finger that's broken? Readability matters.

Last edited by thisoldman (2012-02-29 10:44:40)

wolfdogg · 2012-03-01 02:04:37

Sorry, it was a different type of paragraph, a run on sentence to make a long story short. Didnt realize i wasn't punctuating properly.....

I scrambled all over the office since yesterday, managed to scrub up a sata 150GB, and have a 2Tb on order to solve alot of my problems.

The next statement, should say it all about where im at in the whole process now
"Not enough random bytes available, Give the OS a chance to collect some more entropy!" LOL. (pacman-key --init) its been like that for over 10 mins, and i have a quad core 3.2GB, with 8GB of ram.

So i just reinstalled. Now i have to reinstall onto the other arch box i pulled the hard drive from.

I assume all the problems were due to a ever failing hard drive, as opposed to just a bad file system. I tried every partition method in the book and GRC Spin Rite to get the hard drive to atleast work for a while longer. Its got a big black permie marker X on it now, sitting on the desk.

I just had to correct capitals again, it must be a habit.... I think its my brain thats broken.

Last edited by wolfdogg (2012-03-01 02:05:51)

Arch Linux

#1 2012-02-23 18:32:20

EXT4-fs error loading journal - whats best way to recover

#2 2012-02-23 20:03:49

Re: EXT4-fs error loading journal - whats best way to recover

#3 2012-02-23 21:40:08

Re: EXT4-fs error loading journal - whats best way to recover

#4 2012-02-23 22:53:53

Re: EXT4-fs error loading journal - whats best way to recover

#5 2012-02-24 01:35:54

Re: EXT4-fs error loading journal - whats best way to recover

#6 2012-02-24 10:46:04

Re: EXT4-fs error loading journal - whats best way to recover

#7 2012-02-24 17:58:21

Re: EXT4-fs error loading journal - whats best way to recover

#8 2012-02-24 18:21:27

Re: EXT4-fs error loading journal - whats best way to recover

#9 2012-02-25 07:25:34

Re: EXT4-fs error loading journal - whats best way to recover

#10 2012-02-25 07:43:43

Re: EXT4-fs error loading journal - whats best way to recover

#11 2012-02-25 10:06:51

Re: EXT4-fs error loading journal - whats best way to recover

#12 2012-02-25 11:27:45

Re: EXT4-fs error loading journal - whats best way to recover

#13 2012-02-25 12:00:14

Re: EXT4-fs error loading journal - whats best way to recover

#14 2012-02-29 06:11:35

Re: EXT4-fs error loading journal - whats best way to recover

#15 2012-02-29 10:31:02

Re: EXT4-fs error loading journal - whats best way to recover

#16 2012-03-01 02:04:37

Re: EXT4-fs error loading journal - whats best way to recover

Board footer