You are not logged in.
ChangeLog:
v3.2: Fixed a major memory allocation bug, where I forgot to allocate <amount I needed> + 1 to take the required NULL byte into account. Also tidied up the rather deep/complex/comprehensive/you-get-the-idea if/then block that controls the white/blacklist system. Also, in this version, -d X actually does something.
v3.1: A few new untested flags (-o, -a, -s, -e, -i, -j, and improved -d); user-definable whitelisting and blacklisting.
v3.0: Not released. Basically 3.1 but without some cleanup/bugfixes.
v2: New flag (-p); fixes a major bug involving HTTP responses split across more than one HTTP fragment.
v1: Initial release.
Disclaimer: This program does not respect the /robots.txt specification of excluding automated search/scraping engines/systems from indexing domains. Use the nonexistence of this compliance at your own discretion!
This kind of utility almost certainly already exists, but since I want to learn C, I'm writing as much stuff as I can in C, whenever I can. Sure, writing this in C it made it take like 3-4 days to finish, one of which had me up at like almost 4am, but it was really fun to make, so worth it IMHO. ![]()
What wgetls does is take any HTTP URL - such as a directory listing index page - and recursively retrieve all files that are under it. URLs found in the page that end with / are assumed to be directories; those that don't end with / are assumed to be files.
In the interest of brevity and removing mostly useless information, directories are not returned unless verbosity is enabled; only files.
To use wgetls, you pass a minimum of two parameters: the first is a URL, and the second is either a filename, which can be '-' to cause everything to be dumped to stdout, or a real filename, or one of 3 options that negate the requirement of a filename (but display the results of scanning the server anyway).
So, as an example, if you pass http://example.com/ (no, that isn't a real listing page), it might return:
http://example.com/somedir/file1
http://example.com/anotherdir/hello
http://example.com/anotherdir/world
when the structure of example.com might actually have had 16 other dirs in it, all of which were empty.
Here's the output of wgetls --help:
wgetls is a link retrieval tool for HTML pages.
usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]
[-i str] [-j str] [-d lchX]
you can specify options anywhere (even together, like `-td'), but `URL' has
to come before `listfile', even if option(s) are interspersed between them.
the option parser is very lenient, and will consider stuff like `-dt X' valid.
general options:
URL the URL to look at. cURL supports many types of URLs, but
this program only makes use of normal HTTP URLs.
listfile the file to output the listing to; may be `-' for stdout.
if you omit this option, a listing will not be created, but so
this program does not execute a run in vain, you can't omit this
option unless you specify -t or -d l instead.
-h ...
-t print a tree of the listing as it's discovered to stderr.
the tree code isn't perfect, but it does work.
-p by default, wgetls will prepend the URL you passed to all
the subsequent URLs it finds. this will turn that feature off.
-o overwrite `listfile' if it exists
-a append to `listfile' if it exists
matching options (can be combined):
by default, wgetls will only follow a URL if it ends with `/'. however...
-f ...will cause wgetls to follow every link it finds. this is almost
certainly not what you want. this invalidates the 4 other
options below.
-s str ...will cause the URL to only be followed if it starts with `str'
-e str ...will cause the URL to only be followed if it ends with `str'
-i str ...will cause the URL to not be followed it starts with `str'.
regardless of what you use here, if a URL ends with "../" it
will be skipped.
-j str ...will cause the URL to not be followed if it ends with `str'
debugging params that work with -d (which can, you guessed it, be combined):
l log everything as it happens - absolutely everything. this
includes files/dirs found, parser info, and so on.
c set cURL's CURLOPT_VERBOSE value on.
h print HTTP responses recieved.
X print the value of `i' (the main parser index variable) to stdout
constantly. only use this if the program crashes, so you have a
way of knowing where it died, because this can make a terminal
get very messy...
warning: this program is prone to segfaulting - don't worry if it crashes.
just poke about with the various debug options to see what's going
wrong.
Written by David Lindsay <dav7@dav7.net> 16-21st Dec 08.
>> This is public domain software, and you're using version 3.2. <<The code
Notes
- You will need libcurl ('curl' package) installed for this to work, since wgetls uses libcurl for HTTP handling.
- Both gcc and tcc (a small, 32-bit-only compiler for x86) successfully compile wgetls.
- Under gcc, compilation with -Wall and -pedantic produce no errors. Since wgetls uses the non-ANSI fsync() and snprintf() functions, compiling with -ansi will produce two warnings referencing the aforementioned functions.
Compile with <gcc or tcc> -o wgetls wgetls.c -lcurl.
#include <curl/curl.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#define VERSION "3.2"
CURL *curl_handle;
int tree = 0;
int debuglog = 0, debughttp = 0, debugindex = 0;
int noprepend = 0;
int alwaysfollow = 0;
int append = 0, overwrite = 0;
char *linkprefix = NULL, *linksuffix = "/";
int linkprefixlen = 0, linksuffixlen = 0;
char *blacklistprefix = NULL, *blacklistsuffix = NULL;
int blacklistprefixlen = 0, blacklistsuffixlen = 0;
char *listfile = NULL;
int file, nestlevel = 0;
char **pagedata;
int *pagesize;
void memerr() {
printf("wgetls: fatal memory error!\nQuitting, your data up to this point was hopefully saved to `%s'\n", listfile);
exit(0);
}
void readpage(char *url) {
int i;
char *pos;
int k;
char tmp;
char *link;
int nonempty = 0;
char **charptr;
int *intptr;
int s;
int urllen = strlen(url);
char quotestr[40];
int follow;
int linklen;
int foundlink = 0, href = 0, inquote = 0;
if (debuglog) fprintf(stderr, "entering readpage(), nestlevel = %d\n", nestlevel);
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
curl_easy_perform(curl_handle);
if (debughttp) printf("-- start HTTP response --\n%s\n-- end HTTP response --\n", pagedata[nestlevel]);
pos = strstr(pagedata[nestlevel], "Parent Directory");
i = (pos == NULL ? 0 : pos - pagedata[nestlevel]);
while(i < pagesize[nestlevel]) {
if (debugindex) printf("\n\n>> parser index: %d\n\n", i);
if (!strncmp(pagedata[nestlevel] + i, "<a", 2)) {
foundlink = 1;
snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
if (debuglog) fprintf(stderr, "found '<a' at offset %d: %s...\n", i, quotestr);
}
if (foundlink && !strncmp(pagedata[nestlevel] + i, "href", 4)) {
href = 1;
snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
if (debuglog) fprintf(stderr, "found 'href' attr at offset %d: %s...\n", i, quotestr);
}
if (foundlink && href && pagedata[nestlevel][i] == '"') {
snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
if (debuglog) fprintf(stderr, "found quote (\") at offset %d: %s...\n", i, quotestr);
if (!inquote) {
if (debuglog) fprintf(stderr, "entering quoted string at offset %d\n", i);
inquote = 1;
s = i + 1;
} else if (inquote) {
if (debuglog) fprintf(stderr, "string end found at offset %d\n", i);
follow = 0;
if (linkprefix != NULL)
if (!strncmp(pagedata[nestlevel] + s, linkprefix, linkprefixlen))
follow = 1;
if (linksuffix != NULL)
if (!strncmp(pagedata[nestlevel] + i - 1 + linksuffixlen, linksuffix, linksuffixlen))
follow = 1;
if (blacklistprefix != NULL)
if (!strncmp(pagedata[nestlevel] + s, blacklistprefix, blacklistprefixlen))
follow = 0;
if (blacklistsuffix != NULL)
if (!strncmp(pagedata[nestlevel] + i - 1 - blacklistsuffixlen, blacklistsuffix, blacklistsuffixlen))
follow = 0;
if (alwaysfollow) follow = 1;
if (follow) {
nonempty = 1;
tmp = *(pagedata[nestlevel] + i);
*(pagedata[nestlevel] + i) = '\0';
if (debuglog) fprintf(stderr, "found a valid URL: \"%s\"\n", pagedata[nestlevel] + s);
if (*(pagedata[nestlevel] + i - 1) == '/') {
if (debuglog) fprintf(stderr, "this is a directory.\n");
charptr = realloc(pagedata, (nestlevel + 2) * sizeof(char *));
if (!charptr) memerr();
pagedata = charptr;
intptr = realloc(pagesize, (nestlevel + 2) * sizeof(int));
if (!intptr) memerr();
pagesize = intptr;
pagedata[nestlevel + 1] = calloc(0, 1);
pagesize[nestlevel + 1] = 0;
link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);
if (!noprepend) memcpy(link, url, urllen);
memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s));
linklen = strlen(link);
if (nestlevel > 0 && linklen >= 3) {
if (!strncmp(link + (linklen - 3), "../", 3)) {
if (debuglog) fprintf(stderr, "...aaand it's the previous directory. no thanks, backing out.\n");
goto cont; /* 151 */
}
}
if (debuglog) fprintf(stderr, "entering directory: '%s'\n", link);
if (tree) {
printf(" ");
for (k = 0; k < nestlevel; k++) printf(" ");
printf("+- %s\n", pagedata[nestlevel] + s);
}
nestlevel++;
readpage(link);
nestlevel--;
if (debuglog) fprintf(stderr, "returned to nestlevel %d\n", nestlevel);
cont:
if (debuglog) fprintf(stderr, "leaving directory: '%s'\n", link);
free(link);
} else {
link = calloc((noprepend ? 0 : urllen) + strlen(pagedata[nestlevel] + s) + 100, 1);
if (!noprepend) memcpy(link, url, urllen);
memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s - 1));
if (debuglog) fprintf(stderr, "discovered file: %s\n", link);
if (tree) {
printf(" ");
for (k = 0; k < nestlevel; k++) printf(" ");
printf("+- %s\n", pagedata[nestlevel] + s);
}
if (debuglog) fprintf(stderr, "commiting URL to file \"%s\": %s\n", listfile, link);
write(file, link, strlen(link));
write(file, "\n", 1);
if (debuglog) fprintf(stderr, "syncing file \"%s\"\n", listfile);
fsync(file);
if (debuglog) fprintf(stderr, "sync OK.\n");
free(link);
}
*(pagedata[nestlevel] + i) = tmp;
}
foundlink = href = inquote = 0;
}
}
i++;
}
if (!nonempty && tree) {
printf(" ");
for (k = 0; k < nestlevel; k++) printf(" ");
printf("+- (%s)\n", !nestlevel ? "no URLs" : "empty");
}
if (debuglog) fprintf(stderr, "leaving readpage(), nestlevel %d\n", nestlevel);
}
void showsyntax() {
puts("usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]");
puts(" [-i str] [-j str] [-d lchX]");
}
void usage() {
puts("wgetls is a link retrieval tool for HTML pages.\n");
showsyntax();
printf(
"\nyou can specify options anywhere (even together, like `-td'), but `URL' has\n"
"to come before `listfile', even if option(s) are interspersed between them.\n"
"\nthe option parser is very lenient, and will consider stuff like `-dt X' valid.\n\n"
);
printf(
"general options:\n"
" URL the URL to look at. cURL supports many types of URLs, but\n"
" this program only makes use of normal HTTP URLs.\n"
" listfile the file to output the listing to; may be `-' for stdout.\n"
" if you omit this option, a listing will not be created, but so\n"
" this program does not execute a run in vain, you can't omit this\n"
" option unless you specify -t or -d l instead.\n"
);
printf(
" -h ...\n"
" -t print a tree of the listing as it's discovered to stderr.\n"
" the tree code isn't perfect, but it does work.\n"
" -p by default, wgetls will prepend the URL you passed to all\n"
" the subsequent URLs it finds. this will turn that feature off.\n"
" -o overwrite `listfile' if it exists\n"
" -a append to `listfile' if it exists\n\n"
);
printf(
"matching options (can be combined):\n"
"by default, wgetls will only follow a URL if it ends with `/'. however...\n"
" -f ...will cause wgetls to follow every link it finds. this is almost\n"
" certainly not what you want. this invalidates the 4 other\n"
" options below.\n"
);
printf(
" -s str ...will cause the URL to only be followed if it starts with `str'\n"
" -e str ...will cause the URL to only be followed if it ends with `str'\n"
" -i str ...will cause the URL to not be followed it starts with `str'.\n"
" regardless of what you use here, if a URL ends with \"../\" it\n"
" will be skipped.\n"
" -j str ...will cause the URL to not be followed if it ends with `str'\n"
"\ndebugging params that work with -d (which can, you guessed it, be combined):\n"
);
printf(
" l log everything as it happens - absolutely everything. this\n"
" includes files/dirs found, parser info, and so on.\n"
" c set cURL's CURLOPT_VERBOSE value on.\n"
" h print HTTP responses recieved.\n"
" X print the value of `i' (the main parser index variable) to stdout\n"
" constantly. only use this if the program crashes, so you have a\n"
" way of knowing where it died, because this can make a terminal\n"
" get very messy..."
);
printf(
"\nwarning: this program is prone to segfaulting - don't worry if it crashes.\n"
" just poke about with the various debug options to see what's going\n"
" wrong.\n"
"\nWritten by David Lindsay <dav7@dav7.net> 16-21st Dec 08.\n\n"
">> This is public domain software, and you're using version %s. <<\n\n", VERSION
);
exit(0);
}
static size_t storedata(void *data, size_t size, size_t nmemb) {
int len = size * nmemb;
char *tmp = NULL;
tmp = realloc(pagedata[nestlevel], pagesize[nestlevel] + len + 1);
if (!tmp) memerr();
pagedata[nestlevel] = tmp;
memcpy(pagedata[nestlevel] + pagesize[nestlevel], data, len);
pagesize[nestlevel] += len;
return len;
}
void argerror() {
showsyntax();
puts("try `wgetls --help' for info on what this program does.");
exit(0);
}
void checkcmd(char *arg) {
if (!strstr(arg, "h")) {
printf("wgetls: unknown option `%s'\n", arg);
argerror();
} else {
usage();
}
}
void neednextarg(int i, int argc, char c) {
if (i == argc - 1) {
printf("wgetls: option expected for '-%c'; see -h for help.\n", c);
exit(0);
}
}
int main(int argc, char *argv[]) {
int i, j, k;
int state = 0;
char *url = NULL;
int len, len2;
struct stat statbuf;
int skip = 0;
int filestat;
curl_global_init(CURL_GLOBAL_ALL);
curl_handle = curl_easy_init();
for (i = 1; i < argc; i++) {
len = strlen(argv[i]);
if (skip) { skip = 0; continue; }
if (argv[i][0] == '-' && len > 1) {
for (j = 1; j < len; j++) {
switch(argv[i][j]) {
case 't':
tree = 1;
break;
case 'p':
noprepend = 1;
break;
case 'o':
overwrite = 1;
break;
case 'a':
append = 1;
break;
case 'f':
alwaysfollow = 1;
break;
case 's':
neednextarg(i, argc, 's');
linkprefix = argv[i + 1];
linkprefixlen = strlen(linkprefix);
skip = 1;
break;
case 'e':
neednextarg(i, argc, 'e');
linksuffix = argv[i + 1];
linksuffixlen = strlen(linksuffix);
skip = 1;
break;
case 'i':
neednextarg(i, argc, 'i');
blacklistprefix = argv[i + 1];
blacklistprefixlen = strlen(blacklistprefix);
skip = 1;
break;
case 'j':
neednextarg(i, argc, 'j');
blacklistsuffix = argv[i + 1];
blacklistsuffixlen = strlen(blacklistsuffix);
skip = 1;
break;
case 'd':
neednextarg(i, argc, 'd');
len2 = strlen(argv[i + 1]);
for (k = 0; k < len2; k++) {
switch(argv[i + 1][k]) {
case 'l':
debuglog = 1;
break;
case 'c':
curl_easy_setopt(curl_handle, CURLOPT_VERBOSE, 1);
break;
case 'h':
debughttp = 1;
break;
case 'X':
debugindex = 1;
break;
default:
printf("wgetls: -d: unknown option `%c'. see -h for help.\n", argv[i + 1][k]);
exit(0);
}
}
skip = 1;
break;
default:
checkcmd(argv[i] + 1);
}
}
} else {
switch(state) {
case 0:
url = argv[i];
break;
case 1:
listfile = argv[i];
break;
default:
puts("wgetls: too many options");
argerror();
}
state++;
}
}
if (append && overwrite) {
puts("wgetls: error: user cannot make up their mind as to whether");
puts("they want to append to or overwrite an existing file");
argerror();
}
if (url == NULL) {
puts("wgetls: missing URL");
argerror();
}
if (listfile == NULL) {
printf("wgetls: ");
printf((debuglog || tree) ? "notice" : "error");
printf(": listfile not specified\n");
if (!(debuglog || tree)) {
printf("wgetls: you cannot omit a listfile without using -t or -d X instead\n");
exit(1);
}
} else {
if (strcmp(listfile, "-") != 0) {
filestat = stat(listfile, &statbuf);
if (filestat != -1) {
printf("wgetls: ");
printf(overwrite ? "warning" : (append ? "notice" : "error"));
printf(": file `%s' already exists%s\n", listfile, overwrite ? "; overwriting file" : append ? "; appending to file" : "");
if (!overwrite && !append) exit(0);
}
if ((append && filestat == -1) || !append) {
file = creat(listfile, 0600);
}
if (append) {
file = open(listfile, O_RDWR | O_APPEND);
}
if (file == -1) {
printf("wgetls: error accessing file `%s': %s\n", listfile, strerror(errno));
exit(0);
}
}
}
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, storedata);
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "wgetls/1.0");
curl_easy_setopt(curl_handle, CURLOPT_FOLLOWLOCATION, 1);
pagesize = calloc(1, sizeof(int));
pagedata = calloc(1, sizeof(char *));
pagedata[nestlevel] = calloc(0, 1);
pagesize[nestlevel] = 0;
if (debuglog) fprintf(stderr, "starting page scan of \"%s\"\n", url);
if (tree) fprintf(stderr, "%s\n", argv[1]);
readpage(url);
if (debuglog) fprintf(stderr, "scan complete.\n");
close(file);
curl_global_cleanup();
return 0;
}Last edited by dav7 (2009-01-23 12:37:11)
Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.
Offline
Bump so everyone knows this is at version 3.1 now.
EDIT: fogobogo helped me discover a typo on line 456; "alredy" is "already" now. ![]()
Last edited by dav7 (2009-01-23 10:57:54)
Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.
Offline
Update!
v3.1 has a major bug! Even though it may appear to work properly, strange characters may leak onto the end of URLs, and you may even get segfaults. Sorry I didn't properly test it before releasing it!
Technical details:
I calloc()ed the memory for the link href improperly, per:
link = calloc((noprepend ? 0 : urllen) + i - s, 1);Once I fixed it with "+ 1", per:
link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);it worked great.
I may use a method that doesn't use a calloc() in the future; this works for now, and I'm not gonna knock it...
Last edited by dav7 (2009-01-23 12:38:51)
Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.
Offline