Looking for a file copy daemon

Ambrevar · 2017-08-01 10:17:19

I've been wanting this for so long it must exist.

I'm looking for a commandline tool to start multiple file copies one after the other without blocking user interaction, that is, the copies are queued and processed in the background by some daemon.

As a bonus, the copy daemon could parallelize copies in a smart way: copies that don't involve the same devices can usually be done in parallel at max speed.
The daemon could also throttle the IO: set a max transfer speed, stop parallelizing after the IO write speed has reached some threshold, etc.

A typical workflow could be as follows:

$ cpd -start
copy daemon started.

$ cpc /media/1a/foo /media/1b/foo
Copy '/media/1a/foo'->'/media/1b/foo'
$ cpc /media/2a/bar /media/2b/bar
Copy '/media/2a/foo'->'/media/2b/foo'
$ cpc /media/2a/qux /media/2b/qux  # Since this copy uses the same devices as the previous one, it is not run in parallel.
Queue '/media/2a/foo'->'/media/2b/foo'

$ cpd -status
1: [copying] '/media/1a/foo'->'/media/1b/foo'
2: [copying] '/media/2a/bar'->'/media/2b/bar'
3: [queue] '/media/2a/qux'->'/media/2b/qux'

$ cpd -pause 2  # /media/2a and /media/2b are not used anymore so copy '3' is free to go.
Pause '/media/2a/bar'->'/media/2b/bar'
Copy '/media/2a/qux'->'/media/2b/qux'
$ cpd -resume 2
Queue '/media/2a/bar'->'/media/2b/bar'

$ sleep 42
$ cpd -ls
Queue is empty.

By itself such a tool would be awesome, but most importantly, since it is a commandline tool, it could replace the very limited copy functionality of Emacs (Dired, Helm), ranger, <insert-your-favourite-file-manager>.

Last edited by Ambrevar (2017-08-01 10:24:37)

x33a · 2017-08-01 10:37:06

So basically you want files to be copies in the background? With parallel transfers for different devices and serial for the same device? BTW, IO can be tweaked with ionice.

Ambrevar · 2017-08-01 12:24:53

The parallel/serial feat could be configured via commandline flags and the behaviour could be further forced at runtime.

I know ionice but it's too generic: it's meant for IO scheduling at the kernel level, it works per process, thus if I ionice the following process

cp /media/1a/foo /media/2a/bar /media/3/

the copy of foo and bar will be performed with the same niceness. It lacks fine-grained, device-dependent tuning.

Last edited by Ambrevar (2017-08-01 14:35:41)

dmerej · 2017-08-01 18:08:11

I don't know about such a tool, but the functionality you are looking for reminds me of what "advanced" graphical file managers do ...

ultracopier for instance.

Maybe there's a library somewhere wanting to be written ...

Ambrevar · 2017-08-01 22:31:28

Indeed, if it where not for the graphical part...

As for a library, I'm not sure as I'd say at first glance that most of the job is writing the daemon itself.

NoSuck · 2017-08-02 01:28:36

man batch

batch executes commands when system load levels permit; in other words, when the load average drops below 0.8, or the value specified in the invocation of atd.

There was also another utility like batch but did not necessarily reference system load levels for job queue execution, allowing users to execute manually. I forget the name of it. It was at one time a semi-standard utility; I seem to remember reading about it in an O'Reilly book.

There are plenty of other ways to solve the posited problem without requiring a wholly separate "copy daemon". I'm curious as to why you went that avenue. Try a screen multiplexer. Try backgrounding with ^z. Try batch. Try nqs.

Last edited by NoSuck (2017-08-02 01:29:11)

Ambrevar · 2017-08-03 16:11:07

Multiplexers and shell job controls (^z+fg/bg) have the design defect that they interface something that does not semantically belong to them. If I close my shell, I lose all the job information. As for the multiplexers (or the dtach tool for that matters) it makes it rather inconvenient if you want to have bring job control to Emacs (Dired, Helm), vifm, ranger, <insert-your-fav-file-manager>.

I completely forgot that at/batch existed, thanks for the reminder. As stand-alone job managers, they are close to what I'm looking for. But unless I'm mistaken, there is no simple way to:
- see the command of a job.
- pause/resume jobs.
- look at stdout / stderr (other than by e-mail)
- parallelize the queue according to the source/target devices.

I'm not looking for some kernel IO scheduling, I'd rather process the copy queue based on the source/target devices: when 2 copy jobs don't share devices, they can usually be run in parallel for max efficiency. I want a a tool that can do that automatically, hence the original question on a copy daemon, specifically.

It is sensible to say, however, that a "copy daemon" is a shortsighted idea: what if I want the same feats as rsync in my daemon? I can't afford to copy all features of all copy-programs in the world.

I've had that crazy idea (understand: maybe stupid and unfeasible) about a generic job control system which would handle device-based queueing heuristically. Bear with me... Our tool would work in the fashion of batch/at with those additional features:

- Pause / resume jobs.
- Set stdin/stdout/stderr when queueing a command.
- Display job details. (command, stdin, stdout, stderr)
- Re-order the queue.
- Optionally queue new jobs with device-based logic.

It is possible to infer the source/target devices of most commands by looking at their parameters: for each one that contains a path separator (typically "/"), if the assumed-to-be-a-path exists, get its underlying device. That heuristic should work in cases like

$ cp /media/a /media/b
$ rsync /media/a /media/b

It would fail when a subfolder in the argument contains a mount-point to another device. We could also specify the source/target devices from commandline.

Then the only thing that'd be missing compared to a dedicated copy daemon is some sort of progress tracking (count of processed files over total count) or ETA. But at least we could read the stdout, which gives us some information with `rsync -v` or `cp -v`.

Last edited by Ambrevar (2017-10-12 13:48:21)

Lone_Wolf · 2017-08-04 11:41:17

It would fail when a subfolder in the argument contains a mount-point to another device.

parse mount output to build a table holding all devices used (should be kept uptodate)
Write a lookup function that checks which device in the table corresponds with a given path.

Another table would link active jobs and the devices they use.

This sounds like a wrapper / toolkit / library that may have some daemon functionality.
Tables / databases will play a big role in this, I suggest you start with a high level design .

Ambrevar · 2017-08-04 16:22:26

Lone_Wolf wrote:

parse mount output to build a table holding all devices used (should be kept uptodate)
Write a lookup function that checks which device in the table corresponds with a given path.

There might be a misunderstanding regarding the issue. Let me lay down a concrete example. Say you have the following hierarchy:

/foo
/bar
/bar/baz
/bar/qux

And the followind mount points:

/dev/sda1 /foo
/dev/sdb1 /bar
/dev/sdc1 /bar/qux

Now if I run

jobctl -device-based-scheduling cp -a /bar /foo

jobctl can trivially find that sda1 and sdb1 are involved, but it will fail to see sdc1 without a recursive lookup.

Since a recursive walk can be costly, I could add a commandline option to use it. Either way the efficiency gain becomes somewhat questionable.

Lone_Wolf wrote:

Tables / databases will play a big role in this, I suggest you start with a high level design .

Databases? I am not sure. The number of devices is usually pretty small, and unless I'd go for the recursive walk, there would not be many folders to track.
Even with the recursive option, every call would need a rescan to track filesystem changes, thus the database wouldn't help much.

In the end all we need to keep in memory is the association of job:list-of-devices.

Last edited by Ambrevar (2017-08-04 16:23:30)

drcouzelis · 2017-08-04 16:47:46

Since it sounds like we've already established that such an application doesn't already exist, I have a different question...

Why? What is your specific use case? I can't think of a situation where I would even have enough time to start another file copy command before the first one finished...

Ambrevar · 2017-08-04 17:51:57

When you have a huge load of data on different drives and happen to back them up on other drives...

Ambrevar · 2017-08-06 08:43:20

Or whenever slow drives are involved (USB sticks, old computers), when you want to run a huge compilation at the same time, etc.

Lone_Wolf · 2017-08-06 13:23:02

It all depends on what scope you wan the program to be useful in , and how easy it should be to change the scope.

A high level description of the "copy job controller" (cjc for short) could look like this :
Proritize multiple copy operations based on which devices are involved so they finish as fast as possible .

This makes clear the cjc needs to be able to determine which devices are involved in copy operations.
It also needs a way to keep track of the running jobs.
In order to optimize the jobs cjc needs to understand the capabilities of the devices and determine clashes.

How this is achieved depends on the scope(s) you target.

situation 1. all devices are local, always present on 1 computer system which only has 1 user.

Device presence and capabilities are static, the job list is dynamic .
Coding this will be (relatively) simple.

2. same as 1, but we add removable devices like external drives , cd/dvd player, mediaplayer, smartphone.
Now our device list becomes dynamic and bigger .

3. we add non-local mountpoints like a NAS and cloud storage .

4. our user gets involved in website maintenance or administrating an IRC server and requires sshfs for those.

5. the family members of our user notice their systems often stall during copying, while the system of our user keeps running fine.
They want to have that functionality also.

6. A friend of the user has a small company with 50 employees, asks the user if the cjc can be adapted for his firm

I could continue, but i think that's enough to clarify what i mean.

Adapting a cjc designed specifically for case 1 to function in case 6 or 9 is almost impossible .
A cjc that was written specifically for case 6 could very well be useless for case 1.

On the other hand, a well designed cjc will be (relatively) easy to adapt to any usecase .

----------------------
As for databases : don't let that term scare you .
Any list you make is essentially a database.

Last edited by Lone_Wolf (2017-08-06 13:25:35)

Ambrevar · 2017-08-07 09:21:00

Thank you for this insight, it keeps the idea going. You raised some very relevant points.

1-4: I think it is fundamentally the same thing. A dynamic list of device does not cause any trouble: all we need to know is whether the devices are different or not. It does not matter if they are SSHFS or USB sticks. We can be completely generic here.

1-4(again): It might matter, however, on a different aspect: transfer speed. Should it be taken into account for the queue priorities? That would add a significant level of complexity to the algorithm and maybe to the user interface. I am not sure how useful that could be. I think the naive approach already solves the main issue: that a first slow transfer does not block subsequent fast transfers on other devices.

5-6: Again, 5 and 6 are the same problem. That's a very good point I did not think about though. A few possibilities:
- Ignore the issue and run a per-user daemon. Simple and works in most cases.
- Despite ruining cjc's initial purpose on servers, farms and anything that involves multi-users with heavy tasks, it would not be any worse than the currently available tools.
- A first solution that comes to mind would be a have a global daemon. Which means we need to manage access rules. Even if I don't implement this at first, I suppose I should keep this in mind during the design, that would help tremendously for future work.

As for "cjc", I was trying to say in my previous post that it would be more interesting to go completely generic in terms of "job control", so that it could handle copies as well as other kinds of job. As such I would rather call it "jc"

Last edited by Ambrevar (2017-08-07 10:10:03)

progandy · 2017-08-07 13:08:06

Ambrevar wrote:

jobctl can trivially find that sda1 and sdb1 are involved, but it will fail to see sdc1 without a recursive lookup.
Since a recursive walk can be costly, I could add a commandline option to use it. Either way the efficiency gain becomes somewhat questionable.

Why is it expensive? You need to know you write to /foo, read from /bar and do it recursive. The output of mount should contain all mountpoints, so you will be able to see that there is a mount in /bar and in /bar/nested without traversing the filesystem. Problems only arise if you need to follow symlinks to other volumes.

Lone_Wolf · 2017-08-07 14:22:47

Even if I don't implement this at first, I suppose I should keep this in mind during the design, that would help tremendously for future work.

Apply that to the whole concept and you get the point i wanted to make.

As for 1 to 4 being similar : that was purposely done to draw attention to different capabilities of source / dest, possibly resulting in needing multiple optimisation algortihms.

Last edited by Lone_Wolf (2017-08-07 14:23:40)

Ambrevar · 2017-08-07 14:58:54

progandy wrote:

Problems only arise if you need to follow symlinks to other volumes.

Yes, that's what I meant, sorry if my example was not so good. Many copy programs can follow symlinks (cp, rsync...), which means that there is no other way but do a complete tree walk to scan for all devices. That is costly.

progandy · 2017-08-07 20:28:13

Ambrevar wrote:

progandy wrote:
Problems only arise if you need to follow symlinks to other volumes.
Yes, that's what I meant, sorry if my example was not so good. Many copy programs can follow symlinks (cp, rsync...), which means that there is no other way but do a complete tree walk to scan for all devices. That is costly.

You can either ignore it, live with the delay for the tree walk or build a specialized copy job scheduler with your own copy routine instead of running an external binary. Then you could examine symlinks during the copy process and pause one of the operations until the device is ready again. That is more work than calling cp, though.

Last edited by progandy (2017-08-07 20:28:48)

Ambrevar · 2017-08-08 08:48:47

1. Ignore it: yes, but see the next point.

2. Live with the delay of the tree walk: also yes, so I was thinking of leaving the option to the user as a commandline flag. Now the big question is: which one should be the default behaviour?

3. Specialized copy job scheduler: this was my original post, but the more I think about it, the less it makes sense to me. It would be a daunting task which would mostly consist of re-inventing the wheel (cp and/or rsync) while it would lose the power of being a generic tool (e.g. running `jc make` is awesome).

Ambrevar · 2017-10-12 14:23:59

I've been thinking about the whole idea a little more and I came up with a few good reasons why a "generic job controller" would never reach a concrete state:

1. The whole optimization issue is essentially I/O scheduling, which is highly dependant on _current_ hardware and interfaces. Any optimization would be rendered useless on the first change in the peripherals landscape. Thus any attempt at genericity will be fatally shortsighted, I believe.

2. As for the job control, a lot is to be done, namely managing subprocesses and their respective input/output. Which would eventually lead to building yet another shell... While my initial intent was to make a tool that could work independently of any shell.

A possible ideal would be a workflow around those lines: start all the heavy jobs with

dtach -c /tmp/foo bash -c "ionice ..."

This has the following benefits:

- It works in any environment.
- Jobs continue working when the user logs out.
- Jobs can be paused and resumed thanks to bash's Ctrl+z.
- Every job's actual output can be inspected independently since we are merely running several shells.
- I/O can be somewhat tuned.
- Assume the kernel handles peripheral-based I/O queuing efficiently...

loh.tar · 2017-11-17 07:25:38

Hello Ambrevar,

I found this interesting, so I try to code it just for fun. Some feedback would be nice.

Was Santa early this year? ;-)

Lothar

Ambrevar · 2017-11-20 16:38:38

Wow, I did not expect such a commited prototype "just for fun"! Thanks for giving it a shot!
I'll look into it, I'll come back to you then.

In the mean time I've developped my own approach to this with Eshell (the Emacs shell).

https://github.com/Ambrevar/dotfiles/bl … -detach.el

It uses a combination of dtach, bash and tee to manage processes.
By pressing Shift+<enter> instead of <enter> to send a commandline to the underlying shell, it starts a dtach session of bash running the user-specified commandline.
It allows for closing Emacs or even the user session, resume from any shell (not just Eshell), inspect stdout, stderr, or stdout+stderr independently, pause and resume the process.

While the code is Eshell specific, it's rather short (the longest part are temporary workarounds that should hopefully not be needed some day) and it can be easily adapted to any shell so that Shift+<return> does just that.

I'll see if I can combine this with your work or do any device-dependent scheduling some way.

Last edited by Ambrevar (2017-11-20 17:02:26)

Arch Linux

#1 2017-08-01 10:17:19

Looking for a file copy daemon

#2 2017-08-01 10:37:06

Re: Looking for a file copy daemon

#3 2017-08-01 12:24:53

Re: Looking for a file copy daemon

#4 2017-08-01 18:08:11

Re: Looking for a file copy daemon

#5 2017-08-01 22:31:28

Re: Looking for a file copy daemon

#6 2017-08-02 01:28:36

Re: Looking for a file copy daemon

#7 2017-08-03 16:11:07

Re: Looking for a file copy daemon

#8 2017-08-04 11:41:17

Re: Looking for a file copy daemon

#9 2017-08-04 16:22:26

Re: Looking for a file copy daemon

#10 2017-08-04 16:47:46

Re: Looking for a file copy daemon

#11 2017-08-04 17:51:57

Re: Looking for a file copy daemon

#12 2017-08-06 08:43:20

Re: Looking for a file copy daemon

#13 2017-08-06 13:23:02

Re: Looking for a file copy daemon

#14 2017-08-07 09:21:00

Re: Looking for a file copy daemon

#15 2017-08-07 13:08:06

Re: Looking for a file copy daemon

#16 2017-08-07 14:22:47

Re: Looking for a file copy daemon

#17 2017-08-07 14:58:54

Re: Looking for a file copy daemon

#18 2017-08-07 20:28:13

Re: Looking for a file copy daemon

#19 2017-08-08 08:48:47

Re: Looking for a file copy daemon

#20 2017-10-12 14:23:59

Re: Looking for a file copy daemon

#21 2017-11-17 07:25:38

Re: Looking for a file copy daemon

#22 2017-11-20 16:38:38

Re: Looking for a file copy daemon

Board footer