You are not logged in.
Pages: 1
Topic closed
Hi,
I tried to download some file recursively from ftp, but It seems that wget can't figure a file under symlinks folder by itself.
At first I use
wget -r -N ftp://ftp.ncbi.nih.gov/snp/database/organism_schema
(All folders in organism_schema are symlink one, and I want file underneath them)
but this download only symlink folder but not a file underneath it. I tried --retrv-symlink but it gave me an error
preecha@preecha-laptop:~/dbsnp$ wget -r -N --retr-symlink ftp://ftp.ncbi.nih.gov/snp/database/organism_data | tee error.txt
--2009-01-26 11:07:52-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data
=> `ftp.ncbi.nih.gov/snp/database/.listing'
Resolving ftp.ncbi.nih.gov... 130.14.29.30
Connecting to ftp.ncbi.nih.gov|130.14.29.30|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD /snp/database ... done.
==> PASV ... done. ==> LIST ... done.
[ <=> ] 843 5.24K/s in 0.2s
2009-01-26 11:08:02 (5.24 KB/s) - `ftp.ncbi.nih.gov/snp/database/.listing' saved [843]
Removed `ftp.ncbi.nih.gov/snp/database/.listing'.
--2009-01-26 11:08:02-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/organism_data
=> `ftp.ncbi.nih.gov/snp/database/organism_data/.listing'
==> CWD /snp/database/organism_data ... done.
==> PASV ... done. ==> LIST ... done.
[ <=> ] 4,955 2.14K/s in 2.3s
2009-01-26 11:08:06 (2.14 KB/s) - `ftp.ncbi.nih.gov/snp/database/organism_data/.listing' saved [4955]
Removed `ftp.ncbi.nih.gov/snp/database/organism_data/.listing'.
--2009-01-26 11:08:06-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/arabidopsis_3702
=> `ftp.ncbi.nih.gov/snp/database/organism_data/arabidopsis_3702'
==> CWD not required.
==> PASV ... done. ==> RETR arabidopsis_3702 ...
No such file `arabidopsis_3702'.
--2009-01-26 11:08:07-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/bee_7460
=> `ftp.ncbi.nih.gov/snp/database/organism_data/bee_7460'
==> CWD not required.
==> PASV ... done. ==> RETR bee_7460 ...
No such file `bee_7460'.
--2009-01-26 11:08:08-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/bison_9901
=> `ftp.ncbi.nih.gov/snp/database/organism_data/bison_9901'
==> CWD not required.
==> PASV ... done. ==> RETR bison_9901 ...
No such file `bison_9901'.
--2009-01-26 11:08:10-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/blackbird_39638
=> `ftp.ncbi.nih.gov/snp/database/organism_data/blackbird_39638'
==> CWD not required.
==> PASV ... done. ==> RETR blackbird_39638 ...
No such file `blackbird_39638'.
--2009-01-26 11:08:11-- ftp://ftp.ncbi.nih.gov/snp/database/organism_data/bonobo_9597
=> `ftp.ncbi.nih.gov/snp/database/organism_data/bonobo_9597'
==> CWD not required.
==> PASV ... done. ==> RETR bonobo_9597 ...
No such file `bonobo_9597'.
Did I do something wrong here ? Any suggestion would help a lot. Thanks !
Last edited by Tg (2009-01-27 04:00:46)
Offline
When --retr-symlinks is specified, however, symbolic links are
traversed and the pointed-to files are retrieved. At this time,
this option does not cause Wget to traverse symlinks to directories
and recurse through them, but in the future it should be enhanced
to do this.
I think I've found a good workaround:
#!/bin/bash
URL=$1
DIRPATH=$(echo "$URL" | sed s-ftp://--)
BASEURL=$(echo "$DIRPATH" | cut -d '/' -f1)
wget -r -N $URL
SYMLINKS=$(find $DIRPATH -type l)
for SYMLINK in $SYMLINKS
do
TARGET=$(readlink $SYMLINK)
if [ "${TARGET:0:1}" == "/" ]
then
URI="ftp://$BASEURL/$(readlink $SYMLINK)"
else
URI="ftp://$DIRPATH/$(readlink $SYMLINK)"
fi
echo "retrieving $URI"
wget -r -N $URI
# ./wget_symlinks $URI
done
You can save that as "wget_symlinks", then
chmod 744 wget_symlinks
./wget_symlinks ftp://ftp.ncbi.nih.gov/snp/database/organism_schema
I didn't download everything so I don't know if there are further symlinks. If there are, comment out the line "wget -r -N $URI" and uncomment "# ./wget_symlinks $URI". It should download everything recursively and then exit.
Last edited by Xyne (2009-01-26 05:36:10)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
wget man page wrote:When --retr-symlinks is specified, however, symbolic links are
traversed and the pointed-to files are retrieved. At this time,
this option does not cause Wget to traverse symlinks to directories
and recurse through them, but in the future it should be enhanced
to do this.I think I've found a good workaround:
#!/bin/bash URL=$1 DIRPATH=$(echo "$URL" | sed s-ftp://--) BASEURL=$(echo "$DIRPATH" | cut -d '/' -f1) wget -r -N $URL SYMLINKS=$(find $DIRPATH -type l) for SYMLINK in $SYMLINKS do TARGET=$(readlink $SYMLINK) if [ "${TARGET:0:1}" == "/" ] then URI="ftp://$BASEURL/$(readlink $SYMLINK)" else URI="ftp://$DIRPATH/$(readlink $SYMLINK)" fi echo "retrieving $URI" wget -r -N $URI # ./wget_symlinks $URI done
You can save that as "wget_symlinks", then
chmod 744 wget_symlinks ./wget_symlinks ftp://ftp.ncbi.nih.gov/snp/database/organism_schema
I didn't download everything so I don't know if there are further symlinks. If there are, comment out the line "wget -r -N $URI" and uncomment "# ./wget_symlinks $URI". It should download everything recursively and then exit.
I never though of using bash script (never wrote bash more than 5 line). I guess I'll try that, thanks alot.
Last edited by Tg (2009-01-26 07:27:56)
Offline
Np, I enjoy having little excuses to learn more about bash scripting.
If the script works as expected, please edit your original post and add [SOLVED] to the beginning of subject line.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
It seems like this might be addressing a problem along the same lines that I am trying to solve, but I tried running the script and it didn't work for me.
At ftp://ftp.ncbi.nlm.nih.gov/genomes/genb … _versions/ there are several directories, all prefixed with "GCA". The files I need are found by following these links, which are not being recognized as directories, perhaps due to a bug on the server that displays symbolic links to directories incorrectly. I believe that if these links were being treated as directories, the following command would do what I need it to do, which is to go into those directories to get all .fna.gz files.
wget -nd -rl 0 -A *.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/genb … ocomialis/
So I am wondering if there is a work around. Is there a way to tell `wget` to download files by a regex that matches complete urls? I tried
wget -nd -rl 0 --accept-regex ".*\/GCA.*\/.*\.fna\.gz$" ftp://ftp.ncbi.nlm.nih.gov/genomes/genb … _versions/
Which does indeed match, for example ftp://ftp.ncbi.nlm.nih.gov/genomes/genb … mic.fna.gz
As per https://regex101.com/r/pY0bI8/1
But I'm not still not getting the files I need.
Ultimately, all I want is .fna.gz files located at ftp://ftp.ncbi.nlm.nih.gov/genomes/genb … sions/GCA*
Any suggestions about how I can modify Xyne's script to work for me?
Offline
@truthling,
Please open a new thread and link to this one. Also, use code tags for posting your commands.
https://wiki.archlinux.org/index.php/Fo … bumping.22
Closing.
Offline
Pages: 1
Topic closed