You are not logged in.

#1 2021-02-20 19:28:09

ErdosOrGauss
Member
Registered: 2017-06-21
Posts: 40

[SOLVED] End-to-End File Analysis -- Get the Binary and Convert

Hi all,

I'm not sure if this is the right place to ask this, please move if necessary.

The overall problem I'm having is that I don't know where to look/google/what terms to use etc. I'm sure resources exist on this topic, but finding people who know about them is difficult. If you have a suggestion, please, please post it. If I'm using improper terms to refer to ideas/processes, please correct me. I think my question falls in the realm of file encoding/transfer, but I'm not entirely sure. I have done some googling and ended up looking at file encoding stuff, but not really making any progress.

Basically, I want to be able to take a file's binary representation, do some operations, and then convert it back into a file. As a simple test case, I would like to figure out how to take a file, get its binary representation, and then convert it right back into the same file in memory.

Furthermore, if I wanted to take, say, a .txt file and get the text's binary representation, would I simply process the file's content and "write" out the text's binary representation, process it with some operations, and then convert that binary string back into text and save it to a file? So, moving on to something a little more complex, if I wanted to take, say, a gif or mp3 file and do something that is isomorphic, how would I "process the file's content?" Specifically, let's say I take an mp3 file of "Happy Birthday," and I want to process the recording of it. How do I get the binary representation of "Happy Birthday"?

I know I can take a hex dump and get the binary string. However, this gives me a .txt file of a binary representation. I can then load that into a matrix, but then I only know how to convert it back into a .txt file, not, say, an mp3 file that would play whatever distortion I've applied -- or even just the original song.

I may be over-thinking this, and if I am, please tell me. It would save me a lot of trouble, lol.

Last edited by ErdosOrGauss (2021-02-21 01:29:57)

Offline

#2 2021-02-20 20:09:55

chaseleif
Member
From: Texas
Registered: 2020-08-01
Posts: 18

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

Certain filetypes have extra information which describe the encoding of the file. If you open an input file as binary and copy all of the bits to the output file then it will be the same.

A text file can be opened and read in binary, if it is ASCII then every 8 bits is a character, unicode can be different: https://www.tutorialspoint.com/how-many … rs-in-java

It really depends on what you are trying to do. Here's a short Python script reading an endian converter script (text) file as binary, each byte is then an integer (numeric), it is converted to uppercase, written as binary and written as text:

$ cat swapendian.sh 
#! /bin/bash

if [ -r "$1" ]
then
	xxd -e -g4 ${1} | xxd -r > endianswapped_${1}
	echo "Created endianswapped_${1}"
else
	echo "Cannot read file \"${1}\""
fi
$ cat bin.py
#! /usr/bin/env python

with open('swapendian.sh','rb') as infile:
    data = infile.read()
    print(f'bytes(file) = \n{data}')
    strval=''
    with open('upperbincopy','wb') as outfile:
        for byte in data:
            if byte<=ord('z') and byte>=ord('a'):
                byte-=0x20
            strval+=chr(byte)
            outfile.write(byte.to_bytes(length=1,byteorder='little',signed=False))
    print(f'toupper(file) = \n{strval}')
    with open('uppertextcopy','w') as outfile:
        outfile.write(strval)
$ python bin.py 
bytes(file) = 
b'#! /bin/bash\n\nif [ -r "$1" ]\nthen\n\txxd -e -g4 ${1} | xxd -r > endianswapped_${1}\n\techo "Created endianswapped_${1}"\nelse\n\techo "Cannot read file \\"${1}\\""\nfi\n'
toupper(file) = 
#! /BIN/BASH

IF [ -R "$1" ]
THEN
	XXD -E -G4 ${1} | XXD -R > ENDIANSWAPPED_${1}
	ECHO "CREATED ENDIANSWAPPED_${1}"
ELSE
	ECHO "CANNOT READ FILE \"${1}\""
FI

$ diff upperbincopy uppertextcopy 
$ 

By the way, the Python function to_bytes() takes the byte order to use, since it is being done a byte at a time it doesn't matter what is put in there. My computer is little endian, so I put that.

Last edited by chaseleif (2021-02-20 20:20:33)

Offline

#3 2021-02-20 20:34:18

chaseleif
Member
From: Texas
Registered: 2020-08-01
Posts: 18

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

For specific filetypes, like mp3, they should follow a format:
https://en.wikipedia.org/wiki/MP3#File_structure
http://www.mp3-converter.com/mp3codec/mp3_anatomy.htm

You know how many bits should be in the header, what each bit means, etc. You just factor all of that in when you read the file.

Offline

#4 2021-02-21 01:28:41

ErdosOrGauss
Member
Registered: 2017-06-21
Posts: 40

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

chaseleif wrote:

Certain filetypes have extra information which describe the encoding of the file. If you open an input file as binary and copy all of the bits to the output file then it will be the same.

A text file can be opened and read in binary, if it is ASCII then every 8 bits is a character, unicode can be different: https://www.tutorialspoint.com/how-many … rs-in-java

It really depends on what you are trying to do. Here's a short Python script reading an endian converter script (text) file as binary, each byte is then an integer (numeric), it is converted to uppercase, written as binary and written as text:

$ cat swapendian.sh 
#! /bin/bash

if [ -r "$1" ]
then
	xxd -e -g4 ${1} | xxd -r > endianswapped_${1}
	echo "Created endianswapped_${1}"
else
	echo "Cannot read file \"${1}\""
fi
$ cat bin.py
#! /usr/bin/env python

with open('swapendian.sh','rb') as infile:
    data = infile.read()
    print(f'bytes(file) = \n{data}')
    strval=''
    with open('upperbincopy','wb') as outfile:
        for byte in data:
            if byte<=ord('z') and byte>=ord('a'):
                byte-=0x20
            strval+=chr(byte)
            outfile.write(byte.to_bytes(length=1,byteorder='little',signed=False))
    print(f'toupper(file) = \n{strval}')
    with open('uppertextcopy','w') as outfile:
        outfile.write(strval)
$ python bin.py 
bytes(file) = 
b'#! /bin/bash\n\nif [ -r "$1" ]\nthen\n\txxd -e -g4 ${1} | xxd -r > endianswapped_${1}\n\techo "Created endianswapped_${1}"\nelse\n\techo "Cannot read file \\"${1}\\""\nfi\n'
toupper(file) = 
#! /BIN/BASH

IF [ -R "$1" ]
THEN
	XXD -E -G4 ${1} | XXD -R > ENDIANSWAPPED_${1}
	ECHO "CREATED ENDIANSWAPPED_${1}"
ELSE
	ECHO "CANNOT READ FILE \"${1}\""
FI

$ diff upperbincopy uppertextcopy 
$ 

By the way, the Python function to_bytes() takes the byte order to use, since it is being done a byte at a time it doesn't matter what is put in there. My computer is little endian, so I put that.



Thank you so much for your answers. These were really helpful. I'm glad to know I was on the right track. This gave me a few things to google.

I ended up hopping into c and using fopen/fread to read the file as hex and then converting that to binary. I've pasted the code below. It doesn't account for encoding or anything, but the link for mp3 that you listed is just what I needed. Thank you so much. My earlier attempts at googling ended up putting me with xxd, fold, sed, uuencoding, etc., which I guess is adjacent to what I wanted?

#include "allheads.h"
int main(){
          FILE *pFile;
          char *buffer;
          long size;
          
          pFile = fopen("test.txt","r");
          if (pFile == NULL) perror ("Error opening file");
          else{
               fseek(pFile, 0L, SEEK_END); //go to end of file
               size = ftell(pFile); //save size of file
               fseek(pFile, 0L, SEEK_SET); //reset to the beginning of file

               buffer = (char *)malloc(size * sizeof(char));//make array for file size
               fread(buffer, size, 1, pFile);//read entire file
               fclose(pFile);//close file

               //print the contents of buffer formatted as binary
               for(int i = 0; i < size; i++){//will index array
                    for(int j = 7; j >= 0; j--){//will be used to print binary
                         printf(buffer[i] &(1 << j) ? "1" : "0");
                    }
                    putchar(' ');
               }
               putchar('\n');
          }
}

Last edited by ErdosOrGauss (2021-02-21 01:43:36)

Offline

#5 2021-02-21 16:16:21

chaseleif
Member
From: Texas
Registered: 2020-08-01
Posts: 18

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

C is definitely better, Python was just easy for a short example.

Open the file "rb" for binary, otherwise you are opening it in text mode.

the unsigned char type is a byte, so

//read a byte
unsigned char byte;
fread(&byte,sizeof(unsigned char),1,infile);
//read a 32 bit header
uint32_t mp3header;
fread(&mp3header,sizeof(uint32_t),1,infile);
//toy example
//first bit is set
if (mp3header>>31) {
  //following 8 bits
  int next8bits=(mp3header>>23)&0xFF;
}

Last edited by chaseleif (2021-02-21 16:20:42)

Offline

#6 2021-02-21 17:08:18

progandy
Member
Registered: 2012-05-17
Posts: 4,510

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

Be carfeul and think about the byte order if you read binary data into unsigned integer variables. The file might have one order defined, the processor/program uses another, so you might have to convert.
https://en.wikipedia.org/wiki/Endianness
You might even want to declare a custom structure with bitfields: https://www.tutorialspoint.com/cprogram … fields.htm

Edit: If you do not like C, you can think about C++, rust, golang, ...

Last edited by progandy (2021-02-21 17:19:56)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Online

#7 2021-02-21 17:28:39

rowdog
Member
From: East Texas
Registered: 2009-08-19
Posts: 110

Re: [SOLVED] End-to-End File Analysis -- Get the Binary and Convert

chaseleif wrote:

Open the file "rb" for binary, otherwise you are opening it in text mode.

Linux makes no distinction between text and binary modes. In fact, POSIX systems ignore the "b" flag. See fopen(3).
That said, including the "b" is generally a good practice if your software might wind up running on Windows.

Offline

Board footer

Powered by FluxBB