wgetls: a URL retrieval tool for HTML pages [v3.2]

dav7 · 2008-12-18 03:35:48

ChangeLog:
v3.2: Fixed a major memory allocation bug, where I forgot to allocate <amount I needed> + 1 to take the required NULL byte into account. Also tidied up the rather deep/complex/comprehensive/you-get-the-idea if/then block that controls the white/blacklist system. Also, in this version, -d X actually does something.
v3.1: A few new untested flags (-o, -a, -s, -e, -i, -j, and improved -d); user-definable whitelisting and blacklisting.
v3.0: Not released. Basically 3.1 but without some cleanup/bugfixes.
v2: New flag (-p); fixes a major bug involving HTTP responses split across more than one HTTP fragment.
v1: Initial release.

Disclaimer: This program does not respect the /robots.txt specification of excluding automated search/scraping engines/systems from indexing domains. Use the nonexistence of this compliance at your own discretion!

This kind of utility almost certainly already exists, but since I want to learn C, I'm writing as much stuff as I can in C, whenever I can. Sure, writing this in C it made it take like 3-4 days to finish, one of which had me up at like almost 4am, but it was really fun to make, so worth it IMHO.

What wgetls does is take any HTTP URL - such as a directory listing index page - and recursively retrieve all files that are under it. URLs found in the page that end with / are assumed to be directories; those that don't end with / are assumed to be files.

In the interest of brevity and removing mostly useless information, directories are not returned unless verbosity is enabled; only files.

To use wgetls, you pass a minimum of two parameters: the first is a URL, and the second is either a filename, which can be '-' to cause everything to be dumped to stdout, or a real filename, or one of 3 options that negate the requirement of a filename (but display the results of scanning the server anyway).

So, as an example, if you pass http://example.com/ (no, that isn't a real listing page), it might return:

http://example.com/somedir/file1
http://example.com/anotherdir/hello
http://example.com/anotherdir/world

when the structure of example.com might actually have had 16 other dirs in it, all of which were empty.

Here's the output of wgetls --help:

wgetls is a link retrieval tool for HTML pages.

usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]
              [-i str] [-j str] [-d lchX]

you can specify options anywhere (even together, like `-td'), but `URL' has
to come before `listfile', even if option(s) are interspersed between them.

the option parser is very lenient, and will consider stuff like `-dt X' valid.

general options:
  URL       the URL to look at. cURL supports many types of URLs, but
            this program only makes use of normal HTTP URLs.
  listfile  the file to output the listing to; may be `-' for stdout.
            if you omit this option, a listing will not be created, but so
            this program does not execute a run in vain, you can't omit this
            option unless you specify -t or -d l instead.
  -h        ...
  -t        print a tree of the listing as it's discovered to stderr.
            the tree code isn't perfect, but it does work.
  -p        by default, wgetls will prepend the URL you passed to all
            the subsequent URLs it finds. this will turn that feature off.
  -o        overwrite `listfile' if it exists
  -a        append to `listfile' if it exists

matching options (can be combined):
by default, wgetls will only follow a URL if it ends with `/'. however...
  -f        ...will cause wgetls to follow every link it finds. this is almost
               certainly not what you want. this invalidates the 4 other
               options below.
  -s str    ...will cause the URL to only be followed if it starts with `str'
  -e str    ...will cause the URL to only be followed if it ends with `str'
  -i str    ...will cause the URL to not be followed it starts with `str'.
               regardless of what you use here, if a URL ends with "../" it
               will be skipped.
  -j str    ...will cause the URL to not be followed if it ends with `str'

debugging params that work with -d (which can, you guessed it, be combined):
   l       log everything as it happens - absolutely everything. this
           includes files/dirs found, parser info, and so on.
   c       set cURL's CURLOPT_VERBOSE value on.
   h       print HTTP responses recieved.
   X       print the value of `i' (the main parser index variable) to stdout
           constantly. only use this if the program crashes, so you have a
           way of knowing where it died, because this can make a terminal
           get very messy...

warning: this program is prone to segfaulting - don't worry if it crashes.
         just poke about with the various debug options to see what's going
         wrong.

Written by David Lindsay <dav7@dav7.net> 16-21st Dec 08.

>> This is public domain software, and you're using version 3.2. <<

The code

Notes
- You will need libcurl ('curl' package) installed for this to work, since wgetls uses libcurl for HTTP handling.
- Both gcc and tcc (a small, 32-bit-only compiler for x86) successfully compile wgetls.
- Under gcc, compilation with -Wall and -pedantic produce no errors. Since wgetls uses the non-ANSI fsync() and snprintf() functions, compiling with -ansi will produce two warnings referencing the aforementioned functions.

Compile with <gcc or tcc> -o wgetls wgetls.c -lcurl.

#include <curl/curl.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

#define VERSION "3.2"

CURL *curl_handle;

int tree = 0;
int debuglog = 0, debughttp = 0, debugindex = 0;
int noprepend = 0;
int alwaysfollow = 0;
int append = 0, overwrite = 0;
char *linkprefix = NULL, *linksuffix = "/";
int linkprefixlen = 0, linksuffixlen = 0;
char *blacklistprefix = NULL, *blacklistsuffix = NULL;
int blacklistprefixlen = 0, blacklistsuffixlen = 0;
char *listfile = NULL;

int file, nestlevel = 0;

char **pagedata;
int *pagesize;

void memerr() {
    
    printf("wgetls: fatal memory error!\nQuitting, your data up to this point was hopefully saved to `%s'\n", listfile);
    exit(0);
    
}

void readpage(char *url) {
    
    int i;
    char *pos;
    int k;
    char tmp;
    char *link;
    int nonempty = 0;
    char **charptr;
    int *intptr;
    int s;
    int urllen = strlen(url);
    char quotestr[40];
    int follow;
    int linklen;
    
    int foundlink = 0, href = 0, inquote = 0;
    
    if (debuglog) fprintf(stderr, "entering readpage(), nestlevel = %d\n", nestlevel);
    
    curl_easy_setopt(curl_handle, CURLOPT_URL, url);
    curl_easy_perform(curl_handle);
    
    if (debughttp) printf("-- start HTTP response --\n%s\n-- end HTTP response --\n", pagedata[nestlevel]);
    
    pos = strstr(pagedata[nestlevel], "Parent Directory");
    i = (pos == NULL ? 0 : pos - pagedata[nestlevel]);
    
    while(i < pagesize[nestlevel]) {
        if (debugindex) printf("\n\n>> parser index: %d\n\n", i);
        if (!strncmp(pagedata[nestlevel] + i, "<a", 2)) {
            foundlink = 1;
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found '<a' at offset %d: %s...\n", i, quotestr);
        }
        if (foundlink && !strncmp(pagedata[nestlevel] + i, "href", 4)) {
            href = 1;
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found 'href' attr at offset %d: %s...\n", i, quotestr);
        }
        if (foundlink && href && pagedata[nestlevel][i] == '"') {
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found quote (\") at offset %d: %s...\n", i, quotestr);
            if (!inquote) {
                if (debuglog) fprintf(stderr, "entering quoted string at offset %d\n", i);
                inquote = 1;
                s = i + 1;
            } else if (inquote) {
                if (debuglog) fprintf(stderr, "string end found at offset %d\n", i);
                
                follow = 0;
                
                if (linkprefix != NULL)
                    if (!strncmp(pagedata[nestlevel] + s, linkprefix, linkprefixlen))
                        follow = 1;
                
                if (linksuffix != NULL)
                    if (!strncmp(pagedata[nestlevel] + i - 1 + linksuffixlen, linksuffix, linksuffixlen))
                        follow = 1;
                  
                if (blacklistprefix != NULL)
                    if (!strncmp(pagedata[nestlevel] + s, blacklistprefix, blacklistprefixlen))
                        follow = 0;
                
                if (blacklistsuffix != NULL)
                    if (!strncmp(pagedata[nestlevel] + i - 1 - blacklistsuffixlen, blacklistsuffix, blacklistsuffixlen))
                        follow = 0;
                
                if (alwaysfollow) follow = 1;
                
                if (follow) {
                    
                    nonempty = 1;
                    
                    tmp = *(pagedata[nestlevel] + i);
                    *(pagedata[nestlevel] + i) = '\0';
                    
                    if (debuglog) fprintf(stderr, "found a valid URL: \"%s\"\n", pagedata[nestlevel] + s);
                    
                    if (*(pagedata[nestlevel] + i - 1) == '/') {
                        
                        if (debuglog) fprintf(stderr, "this is a directory.\n");
                        
                        charptr = realloc(pagedata, (nestlevel + 2) * sizeof(char *));
                        if (!charptr) memerr();
                        pagedata = charptr;
                        
                        intptr = realloc(pagesize, (nestlevel + 2) * sizeof(int));
                        if (!intptr) memerr();
                        pagesize = intptr;
                        
                        pagedata[nestlevel + 1] = calloc(0, 1);
                        pagesize[nestlevel + 1] = 0;
                        
                        link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);
                        
                        if (!noprepend) memcpy(link, url, urllen);
                        memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s));
                        
                        linklen = strlen(link);
                        
                        if (nestlevel > 0 && linklen >= 3) {
                            if (!strncmp(link + (linklen - 3), "../", 3)) {
                                if (debuglog) fprintf(stderr, "...aaand it's the previous directory. no thanks, backing out.\n");
                                goto cont; /* 151 */
                            }
                        }
                        
                        if (debuglog) fprintf(stderr, "entering directory: '%s'\n", link);
                        
                        if (tree) {
                            printf("  ");
                            for (k = 0; k < nestlevel; k++) printf("   ");
                            printf("+- %s\n", pagedata[nestlevel] + s);
                        }
                        
                        nestlevel++;
                        
                        readpage(link);
                        
                        nestlevel--;
                        
                        if (debuglog) fprintf(stderr, "returned to nestlevel %d\n", nestlevel);
                        
                        cont:
                        
                        if (debuglog) fprintf(stderr, "leaving directory: '%s'\n", link);
                        
                        free(link);
                        
                    } else {
                        
                        link = calloc((noprepend ? 0 : urllen) + strlen(pagedata[nestlevel] + s) + 100, 1);
                        
                        if (!noprepend) memcpy(link, url, urllen);
                        memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s - 1));
                        
                        if (debuglog) fprintf(stderr, "discovered file: %s\n", link);
                        
                        if (tree) {
                            printf("  ");
                            for (k = 0; k < nestlevel; k++) printf("   ");
                            printf("+- %s\n", pagedata[nestlevel] + s);
                        }
                        
                        if (debuglog) fprintf(stderr, "commiting URL to file \"%s\": %s\n", listfile, link);
                        
                        write(file, link, strlen(link));
                        write(file, "\n", 1);
                        
                        if (debuglog) fprintf(stderr, "syncing file \"%s\"\n", listfile);
                        
                        fsync(file);
                        
                        if (debuglog) fprintf(stderr, "sync OK.\n");
                        
                        free(link);
                        
                    }
                    
                    *(pagedata[nestlevel] + i) = tmp;
                    
                }
                
                foundlink = href = inquote = 0;
                
            }
            
        }
        
        i++;
        
    }
    
    
    if (!nonempty && tree) {
        
        printf("  ");
        for (k = 0; k < nestlevel; k++) printf("   ");
        printf("+- (%s)\n", !nestlevel ? "no URLs" : "empty");
        
    }
    
    if (debuglog) fprintf(stderr, "leaving readpage(), nestlevel %d\n", nestlevel);
    
}

void showsyntax() {
    
    puts("usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]");
    puts("              [-i str] [-j str] [-d lchX]");
    
}

void usage() {
    
    puts("wgetls is a link retrieval tool for HTML pages.\n");
    showsyntax();
    printf(
        "\nyou can specify options anywhere (even together, like `-td'), but `URL' has\n"
        "to come before `listfile', even if option(s) are interspersed between them.\n"
        "\nthe option parser is very lenient, and will consider stuff like `-dt X' valid.\n\n"
    );
    printf(
        "general options:\n"
        "  URL       the URL to look at. cURL supports many types of URLs, but\n"
        "            this program only makes use of normal HTTP URLs.\n"
        "  listfile  the file to output the listing to; may be `-' for stdout.\n"
        "            if you omit this option, a listing will not be created, but so\n"
        "            this program does not execute a run in vain, you can't omit this\n"
        "            option unless you specify -t or -d l instead.\n"
    );
    printf(
        "  -h        ...\n"
        "  -t        print a tree of the listing as it's discovered to stderr.\n"
        "            the tree code isn't perfect, but it does work.\n"
        "  -p        by default, wgetls will prepend the URL you passed to all\n"
        "            the subsequent URLs it finds. this will turn that feature off.\n"
        "  -o        overwrite `listfile' if it exists\n"
        "  -a        append to `listfile' if it exists\n\n"
    );
    printf(
        "matching options (can be combined):\n"
        "by default, wgetls will only follow a URL if it ends with `/'. however...\n"
        "  -f        ...will cause wgetls to follow every link it finds. this is almost\n"
        "               certainly not what you want. this invalidates the 4 other\n"
        "               options below.\n"
    );
    printf(
        "  -s str    ...will cause the URL to only be followed if it starts with `str'\n"
        "  -e str    ...will cause the URL to only be followed if it ends with `str'\n"
        "  -i str    ...will cause the URL to not be followed it starts with `str'.\n"
        "               regardless of what you use here, if a URL ends with \"../\" it\n"
        "               will be skipped.\n"
        "  -j str    ...will cause the URL to not be followed if it ends with `str'\n"
        "\ndebugging params that work with -d (which can, you guessed it, be combined):\n"
    );
    printf(
        "   l       log everything as it happens - absolutely everything. this\n"
        "           includes files/dirs found, parser info, and so on.\n"
        "   c       set cURL's CURLOPT_VERBOSE value on.\n"
        "   h       print HTTP responses recieved.\n"
        "   X       print the value of `i' (the main parser index variable) to stdout\n"
        "           constantly. only use this if the program crashes, so you have a\n"
        "           way of knowing where it died, because this can make a terminal\n"
        "           get very messy..."
    );
    printf(
        "\nwarning: this program is prone to segfaulting - don't worry if it crashes.\n"
        "         just poke about with the various debug options to see what's going\n"
        "         wrong.\n"
        "\nWritten by David Lindsay <dav7@dav7.net> 16-21st Dec 08.\n\n"
        ">> This is public domain software, and you're using version %s. <<\n\n", VERSION
    );
    
    exit(0);
    
}

static size_t storedata(void *data, size_t size, size_t nmemb) {
    int len = size * nmemb;
    char *tmp = NULL;
    tmp = realloc(pagedata[nestlevel], pagesize[nestlevel] + len + 1);
    if (!tmp) memerr();
    pagedata[nestlevel] = tmp;
    memcpy(pagedata[nestlevel] + pagesize[nestlevel], data, len);
    pagesize[nestlevel] += len;
    return len;
}

void argerror() {
    
    showsyntax();
    puts("try `wgetls --help' for info on what this program does.");
    exit(0);
    
}

void checkcmd(char *arg) {
    
    if (!strstr(arg, "h")) {
        printf("wgetls: unknown option `%s'\n", arg);
        argerror();
    } else {
        usage();
    }
    
}

void neednextarg(int i, int argc, char c) {
    
    if (i == argc - 1) {
        printf("wgetls: option expected for '-%c'; see -h for help.\n", c);
        exit(0);
    }
    
}

int main(int argc, char *argv[]) {
    
    int i, j, k;
    int state = 0;
    char *url = NULL;
    int len, len2;
    struct stat statbuf;
    int skip = 0;
    int filestat;
    
    curl_global_init(CURL_GLOBAL_ALL);
    curl_handle = curl_easy_init();
    
    for (i = 1; i < argc; i++) {
        len = strlen(argv[i]);
        if (skip) { skip = 0; continue; }
        if (argv[i][0] == '-' && len > 1) {
            for (j = 1; j < len; j++) {
                switch(argv[i][j]) {
                    case 't':
                        tree = 1;
                        break;
                    case 'p':
                        noprepend = 1;
                        break;
                    case 'o':
                        overwrite = 1;
                        break;
                    case 'a':
                        append = 1;
                        break;
                    case 'f':
                        alwaysfollow = 1;
                        break;
                    case 's':
                        neednextarg(i, argc, 's');
                        linkprefix = argv[i + 1];
                        linkprefixlen = strlen(linkprefix);
                        skip = 1;
                        break;
                    case 'e':
                        neednextarg(i, argc, 'e');
                        linksuffix = argv[i + 1];
                        linksuffixlen = strlen(linksuffix);
                        skip = 1;
                        break;
                    case 'i':
                        neednextarg(i, argc, 'i');
                        blacklistprefix = argv[i + 1];
                        blacklistprefixlen = strlen(blacklistprefix);
                        skip = 1;
                        break;
                    case 'j':
                        neednextarg(i, argc, 'j');
                        blacklistsuffix = argv[i + 1];
                        blacklistsuffixlen = strlen(blacklistsuffix);
                        skip = 1;
                        break;
                    case 'd':
                        neednextarg(i, argc, 'd');
                        len2 = strlen(argv[i + 1]);
                        for (k = 0; k < len2; k++) {
                            switch(argv[i + 1][k]) {
                                case 'l':
                                    debuglog = 1;
                                    break;
                                case 'c':
                                    curl_easy_setopt(curl_handle, CURLOPT_VERBOSE, 1);
                                    break;
                                case 'h':
                                    debughttp = 1;
                                    break;
                                case 'X':
                                    debugindex = 1;
                                    break;
                                default:
                                    printf("wgetls: -d: unknown option `%c'. see -h for help.\n", argv[i + 1][k]);
                                    exit(0);
                            }
                        }
                        skip = 1;
                        break;
                    default:
                        checkcmd(argv[i] + 1);
                }
            }
        } else {
            switch(state) {
                case 0:
                    url = argv[i];
                    break;
                case 1:
                    listfile = argv[i];
                    break;
                default:
                    puts("wgetls: too many options");
                    argerror();
            }
            state++;
        }
    }
    
    if (append && overwrite) {
        
        puts("wgetls: error: user cannot make up their mind as to whether");
        puts("they want to append to or overwrite an existing file");
        argerror();
        
    }
    
    if (url == NULL) {
        puts("wgetls: missing URL");
        argerror();
    }
    
    if (listfile == NULL) {
        
        printf("wgetls: ");
        printf((debuglog || tree) ? "notice" : "error");
        printf(": listfile not specified\n");
        if (!(debuglog || tree)) {
            printf("wgetls: you cannot omit a listfile without using -t or -d X instead\n");
            exit(1);
        }
        
     } else {
         
        if (strcmp(listfile, "-") != 0) {
            
            filestat = stat(listfile, &statbuf);
            
            if (filestat != -1) {
                
                printf("wgetls: ");
                printf(overwrite ? "warning" : (append ? "notice" : "error"));
                printf(": file `%s' already exists%s\n", listfile, overwrite ? "; overwriting file" : append ? "; appending to file" : "");
                if (!overwrite && !append) exit(0);
                
            }
            
            if ((append && filestat == -1) || !append) {
                
                file = creat(listfile, 0600);
                
            }
            
            if (append) {
                
                file = open(listfile, O_RDWR | O_APPEND);
                
            }
            
            if (file == -1) {
                
                printf("wgetls: error accessing file `%s': %s\n", listfile, strerror(errno));
                exit(0);
                
            }
            
        }
        
    }
    
    curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, storedata);
    curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "wgetls/1.0");
    curl_easy_setopt(curl_handle, CURLOPT_FOLLOWLOCATION, 1);
    
    pagesize = calloc(1, sizeof(int));
    pagedata = calloc(1, sizeof(char *));
    
    pagedata[nestlevel] = calloc(0, 1);
    pagesize[nestlevel] = 0;
    
    if (debuglog) fprintf(stderr, "starting page scan of \"%s\"\n", url);
    
    if (tree) fprintf(stderr, "%s\n", argv[1]);
    
    readpage(url);
    
    if (debuglog) fprintf(stderr, "scan complete.\n");
    
    close(file);
    
    curl_global_cleanup();
    
    return 0;
    
}

Last edited by dav7 (2009-01-23 12:37:11)

dav7 · 2009-01-23 10:46:09

Bump so everyone knows this is at version 3.1 now.

EDIT: fogobogo helped me discover a typo on line 456; "alredy" is "already" now.

Last edited by dav7 (2009-01-23 10:57:54)

dav7 · 2009-01-23 12:24:59

Update!

v3.1 has a major bug! Even though it may appear to work properly, strange characters may leak onto the end of URLs, and you may even get segfaults. Sorry I didn't properly test it before releasing it!

Technical details:

I calloc()ed the memory for the link href improperly, per:

link = calloc((noprepend ? 0 : urllen) + i - s, 1);

Once I fixed it with "+ 1", per:

link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);

it worked great.

I may use a method that doesn't use a calloc() in the future; this works for now, and I'm not gonna knock it...

Last edited by dav7 (2009-01-23 12:38:51)

Arch Linux

#1 2008-12-18 03:35:48

wgetls: a URL retrieval tool for HTML pages [v3.2]

#2 2009-01-23 10:46:09

Re: wgetls: a URL retrieval tool for HTML pages [v3.2]

#3 2009-01-23 12:24:59

Re: wgetls: a URL retrieval tool for HTML pages [v3.2]

Board footer