Help with a script to process a few TB of images

srulop · 2012-02-12 00:54:30

Hello,

I have almost 400,000 images from a scientific experiment. All are in binary format, specific to the facility where the experiment took place, but can be opened with Octave or ImageJ (both verified) and maybe ImageMagick as well (I didn't succeed).
All have a header, that starts with "{" and ends with "}".
An example:

{
HeaderID       = EH:000001:000000:000000 ;
Image          = 1 ;
ByteOrder      = LowByteFirst ;
DataType       = UnsignedShort ;
Dim_1          = 2048;
Dim_2          = 2048;
Size           = 8388608;
count_time     = Na ;
point_no       = 0 ;
preset         = Na ;
col_end        = 2047;
col_beg        = 0;
row_end        = 2047;
row_beg        = 0;
col_bin        = 1;
row_bin        = 1;
time           = Wed Nov 23 21:41:05 2011;
time_of_day    = 1322080865.991661;
dir            = /data/visitor/HM1_diffTomo_zone2_LR2um_15x12;
suffix         = .edf;
prefix         = HM1_diffTomo_zone2_LR;
run            = 1;
title          = ESPIA FRELON Image 0001 [# 0];
time_of_frame  = 0.271140;
                                                                                                                                                                                                                                                                                                                             }
^daedee_c^bbbidg`hfigg`cabddeede_ggida...
MORE DATA MORE DATA.........

The problem is that these files are very large and the format is not recognized by many "regular" image processing programs.

My goal is to make two files instead of the original one: one would be a text file containing only the header, and the other - a lossless .tif image (converted from all the binary data, minus the header).
I'm looking for a script that could do it for many images. If anyone has any idea - it will be so helpful.

Thanks,
L.

Awebb · 2012-02-12 02:39:06

It is hard to tell you anything without knowing more about the files. Could you upload two examples (probably stripped of all relevant data or created as dummies)?

karol · 2012-02-12 03:10:14

Would

sed -n 2,25p $filename > $filename.header

work? It simply prints the lines 2 through 25 to a file.

sed 1,26d $filename > $filename.tif

prints all but the first 26 lines to a file.

srulop · 2012-02-12 11:19:23

Thanks for the replies!

the problem with:

sed -n 2,25p $filename > $filename.header

is that the header is not always 24 lines long. In some files it is longer, and in some - shorter. The only sure way to separate the header from the rest of the file, is that it begins with "{" and ends with "}".

as for:

sed 1,26d $filename > $filename.tif

Even ignoring what was mentioned before, it can't be done like that, because the file is in raw binary format, while .tif image is compressed. I was thinking calling Octave or ImageMagick to somehow transform the file stripped from the header into .tif, and then to delete the stripped file, while leaving the original. Can't figure out how to do this though - very new to scripting.

I uploaded one file for example:
http://dl.dropbox.com/u/14434681/HM1_di … LR0001.edf

Awebb, I could upload two or more, but they all are basically the same, except the length of the header and the data itself, obviously.

Thank you again for your help!

Roken · 2012-02-12 11:53:49

You say that ImageJ will view the files. I don't use ImageJ, but if the one I've found is the same one you could write a batch file that uses ImageJ to convert each file.

Here: http://rsbweb.nih.gov/ij/docs/guide/use … Section-16 and here: http://rsbweb.nih.gov/ij/docs/guide/use … tion-23.10 are particularly relevant.

awk should get the header information for you:

awk '!p;/^\}/{p=1}'

Last edited by Roken (2012-02-12 12:11:08)

srulop · 2012-02-12 15:00:19

The problem is not only getting the info from the header, but also writing a new file, which is like the original, but without the header. (Because only then I can open it in ImageJ)

I found two useful sed commands that can do this:
First reads everything between FOO and BAR:

sed -n '/FOO/,/BAR/p' test.txt

And the second - writes a file with everything except what's between FOO and BAR:

sed '/FOO/,/BAR/d' input.txt > output.txt

The only problem that remains is: when I switch FOO with { and BAR with }, it finds me many instances of such combination in the file.
How can I do it only for the first instance of FOO and the first instance of BAR?

zorro · 2012-02-12 21:45:03

I have written a sed script that deletes the contents of the first opening and closing {}'s.

Here is a test file
{
1
22
333
4444
}
55555
{
666666
}
7777777
88888888

sed -n ':start; s/{.*}//;t blank; N; s/\n//;t start; :blank;N;s/\n//; :rest;p;N;s/.*\n//;b rest' < test_file

Generates:
55555
{
666666
}
7777777
88888888

There must be a simpler way to achieve this...

Last edited by zorro (2012-02-12 21:46:47)

/dev/zero · 2012-02-12 22:35:39

srulop wrote:

How can I do it only for the first instance of FOO and the first instance of BAR?

What about this (for, say, test_file):

tail --lines=+$(($(grep -nm1 '}' test_file | awk -F':' '{print $1}')+1)) test_file

The grep piped through awk returns the line number of the first "BAR".

Roken · 2012-02-13 00:05:06

/dev/zero wrote:

What about this (for, say, test_file):
tail --lines=+$(($(grep -nm1 '}' test_file | awk -F':' '{print $1}')+1)) test_file
The grep piped through awk returns the line number of the first "BAR".

Y'know, I saw this as something of a challenge, and I spent ages trying to work that out. I finally got a rather inelegant
solution using wc and bc, but I was sure it should be doable with grep and awk. TY

/dev/zero · 2012-02-13 00:18:41

Happy to help

igndenok · 2012-02-13 00:39:17

srulop wrote:

The only problem that remains is: when I switch FOO with { and BAR with }, it finds me many instances of such combination in the file.
How can I do it only for the first instance of FOO and the first instance of BAR?

Maybe you can try this ?!

sed -n '/^{/,/}/p' input

or this

sed '/^{/,/}/d' input > output

whitie · 2012-02-13 12:55:53

Hi,
just do it with Python:

#!/usr/bin/env python

import sys

def main(filename):
    with open(filename, 'rb') as fp:
        header, imgdata = fp.read().split('}\n')
    with open('{0}.txt'.format(filename), 'w') as fp:
        fp.write(header.strip('{\n '))
    with open('{0}.tif'.format(filename), 'wb') as fp:
        fp.write(imgdata)

if __name__ == '__main__':
    try:
        main(sys.argv[1])
    except IndexError:
        print('Usage: python {0} FILENAME'.format(sys.argv[0]))

Whitie

srulop · 2012-02-13 13:38:45

igndenok, what you suggested is not good, because there may be a line that starts with "{" somewhere in the file, not just the first line of the header.
Found a way to extract the header only using sed:

sed -n -e '/{/,/}/p' -e '/}/q' HM1_diffTomo_zone2_LR0001.edf > hout

And found a way to convert the image to tif, even if it does have a header:

convert -endian LSB -depth 16 -size 2048x2048+1024 gray:HM1_diffTomo_zone2_LR0001.edf -auto-level -compress zip image.tif

Here, 2048 is the x and y dimension, and 1024 is the length of the header in bytes.
So now, there is no need to extract the header from the original, just to know it's length. But another problem arose: not all the pictures are 2048 pixels in x,y. (Although all are square)

So now, my goal is:
1) To copy the header to new file. (now solved)
2) Define a variable of it's length in bytes. (du -b? but du gives me the length + name, and I just want the length...)
3) Define a variable of the dimensions from the header, meaning extract the number from the line "Dim_1 = 2048; ". (how?)
4) Convert to tif using these two variables. (now solved)

Thanks for any ideas!

Last edited by srulop (2012-02-13 13:40:46)

Roken · 2012-02-13 14:33:57

You can get the dimensions with:

XDIM=`cat $FILE | grep Dim_1 | sed 's/.*= \([0-9]*\).*/\1/'`
YDIM=`cat $FILE | grep Dim_2 | sed 's/.*= \([0-9]*\).*/\1/'`

Where $FILE is either the name of the original file or the extracted header saved to a file.

Last edited by Roken (2012-02-13 14:36:16)

srulop · 2012-02-13 15:40:31

Of course! Thanks Roken!

And for the length in bytes:

headLength=`du -b $header | sed 's/[ \ta-z]*$//'`

Thank you very much guys, helped me a lot!

zorro · 2012-02-13 20:30:44

Using the test file from my previous post.

Create the header:

sed -n '/{/,/}/p; /}/q' < test_file > header

Extract the body:

comm -13 --nocheck-order header test_file > body

Arch Linux

#1 2012-02-12 00:54:30

Help with a script to process a few TB of images

#2 2012-02-12 02:39:06

Re: Help with a script to process a few TB of images

#3 2012-02-12 03:10:14

Re: Help with a script to process a few TB of images

#4 2012-02-12 11:19:23

Re: Help with a script to process a few TB of images

#5 2012-02-12 11:53:49

Re: Help with a script to process a few TB of images

#6 2012-02-12 15:00:19

Re: Help with a script to process a few TB of images

#7 2012-02-12 21:45:03

Re: Help with a script to process a few TB of images

#8 2012-02-12 22:35:39

Re: Help with a script to process a few TB of images

#9 2012-02-13 00:05:06

Re: Help with a script to process a few TB of images

#10 2012-02-13 00:18:41

Re: Help with a script to process a few TB of images

#11 2012-02-13 00:39:17

Re: Help with a script to process a few TB of images

#12 2012-02-13 12:55:53

Re: Help with a script to process a few TB of images

#13 2012-02-13 13:38:45

Re: Help with a script to process a few TB of images

#14 2012-02-13 14:33:57

Re: Help with a script to process a few TB of images

#15 2012-02-13 15:40:31

Re: Help with a script to process a few TB of images

#16 2012-02-13 20:30:44

Re: Help with a script to process a few TB of images

Board footer