[C,Haskell] efficient low-level file concatenation

Xyne · 2010-06-21 08:43:54

The file "foo" has been split into

foo.1
foo.2
foo.3
foo.4

To recreate "foo" on the command-line and get rid of the parts, you could do this:

cat foo.* > foo && rm foo.*

How would you do this efficiently in C or Haskell?

Approaches

create foo
write foo.1 to foo
remove foo.1
write foo.2 to foo
remove foo.2
...

The data of each part is duplicated once.

open foo.1 for appending
write foo.2 to foo.1
remove foo.2
...
rename foo.1 foo

The data of each part except foo.1 is duplicated.

Both approaches require enough free disk space to accommodate the largest part.

The best method that I can think of would be to instruct the file system to simply consider the parts as a single fragmented file in situ. This would avoid almost all disk IO and would remove the need for extra space. Are there system calls that can do that? If there are, please provide any relevant links that you have (documentation, tutorials, libraries). Searches for such calls have a low signal-to-noise ratio.

If not, could the second approach be improved?

mikesd · 2010-06-21 09:50:51

There have been a couple of posts concerning this recently on the btrfs mailing list recently.

Discussion: http://forum.nginx.org/read.php?30,90555,90581

btrfs info: http://permalink.gmane.org/gmane.comp.f … btrfs/5833

Btrfs already has a CLONE_RANGE ioctl that will clone a range of
(block-aligned) bytes from file A to any offset in file B. The fs just
fixes up the file metadata to reference the same bytes on disk without
reading or writing any actual file data.

Last edited by mikesd (2010-06-21 09:54:08)

cesura · 2010-06-21 10:25:20

Method 1 and Method 2 would use almost the exact same amount of "work", except that with Method 1, you are creating a whole new foo file, and Method 2 you are taking an existing file and renaming it. Creating new files is usually not the most efficient way to go (then again, I could be wrong).

EDIT: As you said, with Method 2, there is less data manipulation, so it would in theory be faster, but not even noticeable unless used on an extremely large scale.

Last edited by cesura (2010-06-21 10:27:29)

Xyne · 2010-06-21 17:17:52

@mikesd
That is exactly what I want, but even if btrfs were ready now I couldn't assume its presence. Ideally I would want something universal but as that's probably impossible, I would need at least something that's part of the POSIX standard. I realize that the diversity of filesystems probably precludes this. I was hoping that there would be some common system call because most (all?) filesystems experience fragmentation, which must mean that they are able to specify non-contiguous regions on the disk as belonging to the same file.

@itsbrad212
I only listed the first method (creating an entirely new file) to provide a comparison to the second (which I will be using in the absence of something better). I just wanted to show what I meant by "efficiency" as the second method avoids moving around the bytes in foo.1.

I agree that the difference is most likely negligible but I want something that can scale reasonably well (either to large files, or to multiple parallel operations). Besides, I find it conceptually attractive in general to minimize overhead. Even if it has no real practical effect, it's fun from a design perspective. I see programming as puzzle-solving most of the time.

mikesd · 2010-06-22 00:32:58

Xyne wrote:

@mikesd
That is exactly what I want, but even if btrfs were ready now I couldn't assume its presence. Ideally I would want something universal but as that's probably impossible, I would need at least something that's part of the POSIX standard. I realize that the diversity of filesystems probably precludes this. I was hoping that there would be some common system call because most (all?) filesystems experience fragmentation, which must mean that they are able to specify non-contiguous regions on the disk as belonging to the same file.

Yeah, it sounds like this is still a ways away. Someone else mentioned that this behavior would also be doable with a fs with built in de-duplication. I think that these fs are still in the testing phase as well, not sure.

Arch Linux

#1 2010-06-21 08:43:54

[C,Haskell] efficient low-level file concatenation

#2 2010-06-21 09:50:51

Re: [C,Haskell] efficient low-level file concatenation

#3 2010-06-21 10:25:20

Re: [C,Haskell] efficient low-level file concatenation

#4 2010-06-21 17:17:52

Re: [C,Haskell] efficient low-level file concatenation

#5 2010-06-22 00:32:58

Re: [C,Haskell] efficient low-level file concatenation

Board footer