You are not logged in.
Pages: 1
So, at work we have this software that reads data from a machine and stores it in a proprietary binary format. There is a separate program that translates that binary into a CSV. My goal is to write my own program that can read the binary and make a CSV out of it, but I just don't know what to do with this thing. I did a text dump of the binary in hex and I can recognize the company's name, and then the column titles from the CSV, but after that it's a mess. There is very little pattern, and the hex->ASCII is totally random symbols with no meaning I can find.
Has any one done something like this before? Any tips? I've googled, but without much luck.
Offline
Can you just bypass the proprietary format? I mean just take the data from the machine and do what you want with it without converting into and out of the binary format. Reverse engineering a binary format is a long tedious process but there's info on it if you google for "reverse engineer binary file format".
Last edited by brianhanna (2010-03-04 18:32:39)
Offline
Do you know C programming? Assembler?
If not, give up now.
Linux user since redhat 6.1. former gentooer, former slacker. Now arher.
Offline
Don't give up. Just try to make sure it will be worth the time you'll need to spend learning.
Offline
Unfortunately no, we can't just bypass the format. It reads the raw data in this format, then we have to tell it to convert to CSV. I have been looking at it in a Hex Editor and not getting to far. I actually do see some repetition, which is good because the some columns in the corresponding CSV are all the same number.
The worst part is that it is probably only collecting some kind of voltage or current read out and then converting those into values for the CSV, in which case I wont have much of a clue about what is what other than location in the file. THEN I'll have to convert from whatever trickery the software measures to values I want.
I do know my way around C, but it's been a while since I've done something so low level. Assembly, no.
Oh, believe me, I know this will be tedious. It already is...
EDIT: This is intended to be an ongoing project. There is only one computer that has the software on it, and since it's also running/monitoring the equipment, getting it to open and convert the files is like pulling teeth. If we could just scp the raw files off of it and convert them somewhere else, it will save us a lot of pain.
Last edited by pogeymanz (2010-03-04 18:58:10)
Offline
you can try to find columns by changing one byte at a time and checking when the next column in the csv output is affected.
if you got your columns create sample files with binary patterns (0*, f*, ...) and see if you can figure out what happens in the black box.
if you can (easily) automate the process of conversion that would obviously be helpful.
other than that you can always fire up a debugger, though that will require some understanding of low level stuff
Last edited by raf_kig (2010-03-04 19:08:15)
Offline
you can try to find columns by changing one byte at a time and checking when the next column in the csv output is affected.
if you got your columns create sample files with binary patterns (0*, f*, ...) and see if you can figure out what happens in the black box.if you can (easily) automate the process of conversion that would obviously be helpful.
other than that you can always fire up a debugger, though that will require some understanding of low level stuff
That's a good idea. I'm going to go ahead and try something like that to find my columns.
Offline
Maybe you could try to first create a file that contains just one column and one row? Then modify that cell's value and check what changes with the HEX editor. Then gradually move up to more rows & columns.
Note that I have never done something like that before, but that might be a reasonable approach to tackle the problem. Also: keep us updated on your progress! I'm definitely interested!
Offline
Do you know the bit length of any of the data? is it integer or floating point? are there any strings?
If there are no strings you should be able to figure out the length of each variable (they are probably all the same if the none of the data is strings)
The file is probably fairly straight forward, something like this
[Header]
[Record1]
[Record2]
[etc...]
so you can figure out an approx. record length by converting it to csv and counting the rows then
record size is ~= binary filesize/record count
this is assuming you have a lot of rows and the header is relatively small.
Now if all your values are integers or floats you can do record size/column count to get an approximate value length
the value length will probably be a whole number, either 2, 4, 8 bytes (16, 32 or 64 bit)
That is assuming the simplest case mind you, but a variation should help anyway.
Cheers,
Jon Estey
Offline
Though I don't want to discourage you, I suggest you think this through again...
Guessing binary data is... well, just guessing. You never know if you really
grasped the format or if you've just dealt with certain files until now and the
next one will introduce some format specialty you haven't encountered yet. Not
to mention what happens if the developers decide to bring out a new version of
the format (yes, this happens!).
If you don't have access to the format documentation from the company who
defines it, I would really advise you to give up on this, except you see it as a
hobby and have the nerves and time to change your parsing code over and over
again, because that's what you'll have to do.
Offline
To add another idea, depending on how you connect to this machine, build a sniffer* that checks what data is transmitted, ideally also which way it is transmitted.
If you get enough data you might be able to bypass the software entirely. Note that in this case the project time span probably went above one year. If it's much data it may be several years.
*If it is rs232 or parallell this should be easy if you know electronics, if it's usb/ethernet it is difficult, if you don't know electronics I guess there is stuff you can buy, but building one is not an option.
Last edited by tlvb (2010-03-05 15:48:00)
I need a sorted list of all random numbers, so that I can retrieve a suitable one later with a binary search instead of having to iterate through the generation process every time.
Offline
Update:
Well, it turns out that this isn't going to be so hard. The proprietary binary stores the data in little-endian floating-point numbers (well, the hex representation of that... I don't know the lingo, sorry). Basically, I just have to convert from Hex to little-endian floats for the actual numbers. Also, every line is 472 bytes long, so I can just add a newline after every 472 bytes worth of data.
Now to come up with a simple way of doing that and I'll be pretty much done...
We started out on a hunch that this company wasn't going to go out of their way to make their binaries uncrackable. Why would they want to? It's software that goes with a multi-thousand dollar machine, it isn't like they even sell the software separately...
Offline
The proprietary binary stores the data in little-endian floating-point numbers (well, the hex representation of that... I don't know the lingo, sorry).
I would guess the binary file is storing the data in....well....binary, not hex. It is a binary file after all. You just happen to be viewing the data in a hex viewer.
Offline
As an aside have you asked the company for the file format? Its unlikely, but they may give you the spec, or details on an easier method of accessing it.
Processing binary files in c/c++ is relatively easy, easier then text files actually.
Something along these lines...
typedef struct
{
float value1;
float value2;
float value3;
char pad[472];
} DATA_ROW;
FILE *fid = fopen("inputfile","r");
DATA_ROW buffer;
while (fread(&buffer, 1, 472, fid) = 472)
{
// Process data line
}
fclose(fid);
The proper way of doing this would be to have all the appropriate values in the structure and not have the pad variable there at all, barring that
having the pad variable allows you to read a full row into the structure without worrying about overflowing.
Cheers,
Jon
Offline
Don't forget to use "b" in the specifier to fopen. Otherwise, depending on your platform, you may get e.g. newline mutations that will totally wreck your binary data.
FILE *fid = fopen("inputfile", "rb");
Offline
Don't forget to use "b" in the specifier to fopen. Otherwise, depending on your platform, you may get e.g. newline mutations that will totally wreck your binary data.
FILE *fid = fopen("inputfile", "rb");
Only on non-POSIX-conforming systems. I suppose if OP is doing this on Windows (eww) then that would be a necessity, but he likely isn't.
** typo: 'then' not 'than'
Last edited by Peasantoid (2010-03-10 22:20:38)
Offline
Maybe, but it's only one more character to the code, and it's quite possible -- even very likely -- that the future will see the OP needing to run this same code on another platform. Not necessarily Windows -- what about IBM's mainframes or any other EBCDIC based system? Given that this is supposed to replace an older proprietary program, it's possible that it needs to run on something even more obscure.
In other words: It may not matter on UNIX systems, but you should do it anyway.
Offline
Maybe, but it's only one more character to the code, and it's quite possible -- even very likely -- that the future will see the OP needing to run this same code on another platform. Not necessarily Windows -- what about IBM's mainframes or any other EBCDIC based system? Given that this is supposed to replace an older proprietary program, it's possible that it needs to run on something even more obscure.
In other words: It may not matter on UNIX systems, but you should do it anyway.
Actually, you're correct except we aren't working on anything that fancy. Just a Windows machine and a bunch of Macs.
I'll give more of an update later but I'm only in front of the computer for a minute right now.
Offline
Processing binary files in c/c++ is relatively easy, easier then text files actually.
Something along these lines...The proper way of doing this would be to have all the appropriate values in the structure and not have the pad variable there at all, barring that
having the pad variable allows you to read a full row into the structure without worrying about overflowing.Cheers,
Jon
There are more issues with that sample:
You place three float at the struct (~12 bytes), plus 472 bytes. But then you read 472 byte blocks.
You want to read the file in blocks of the same size as DATA_ROW, so I would directly use sizeof(DATA_ROW).
If you know the block size but not all the fields (eg. you ltraced the program), you can add padding chars in DATA_ROW so that it sums up to the total size (if they are 472-byte blocks, you would pad with 460 bytes), or make it an union with a 472 bytes array alternative to the known fields.
You might also have padding issues from the compiler. Consider using __attribute__((packed))
You perform the comparison with a single equal (although in this case it's not too dangerous, since it will refuse to compile).
Plus, it might return a different number without it being an error.
Reading binary files is not hard. Figuring out what fields are there is. Also, take into account endianess when reading at yet-unknown bytes.
Offline
There are more issues with that sample:
You place three float at the struct (~12 bytes), plus 472 bytes. But then you read 472 byte blocks.
You want to read the file in blocks of the same size as DATA_ROW, so I would directly use sizeof(DATA_ROW).If you know the block size but not all the fields (eg. you ltraced the program), you can add padding chars in DATA_ROW so that it sums up to the total size (if they are 472-byte blocks, you would pad with 460 bytes), or make it an union with a 472 bytes array alternative to the known fields.
You might also have padding issues from the compiler. Consider using __attribute__((packed))You perform the comparison with a single equal (although in this case it's not too dangerous, since it will refuse to compile).
Plus, it might return a different number without it being an error.Reading binary files is not hard. Figuring out what fields are there is. Also, take into account endianess when reading at yet-unknown bytes.
Since we're being picky you should probably avoid trying to process data with a comment as well
The padding was done for simplicity, the whole record is 472 bytes so if your switching things around alot its easier to make your buffer too big and not use it all then to recalculate it every time you change something.
Endians is a good point. I think it would be safe to assume (at least to start with) that its not an issue as the software is written for windows and probably adheres to those standards.
Cheers
Offline
Pages: 1