You are not logged in.

#1 2008-12-18 03:35:48

dav7
Member
From: Australia
Registered: 2008-02-08
Posts: 674

wgetls: a URL retrieval tool for HTML pages [v3.2]

ChangeLog:
v3.2: Fixed a major memory allocation bug, where I forgot to allocate <amount I needed> + 1 to take the required NULL byte into account. Also tidied up the rather deep/complex/comprehensive/you-get-the-idea if/then block that controls the white/blacklist system. Also, in this version, -d X actually does something.
v3.1: A few new untested flags (-o, -a, -s, -e, -i, -j, and improved -d); user-definable whitelisting and blacklisting.
v3.0: Not released. Basically 3.1 but without some cleanup/bugfixes.
v2: New flag (-p); fixes a major bug involving HTTP responses split across more than one HTTP fragment.
v1: Initial release.

Disclaimer: This program does not respect the /robots.txt specification of excluding automated search/scraping engines/systems from indexing domains. Use the nonexistence of this compliance at your own discretion!

This kind of utility almost certainly already exists, but since I want to learn C, I'm writing as much stuff as I can in C, whenever I can. Sure, writing this in C it made it take like 3-4 days to finish, one of which had me up at like almost 4am, but it was really fun to make, so worth it IMHO. big_smile

What wgetls does is take any HTTP URL - such as a directory listing index page - and recursively retrieve all files that are under it. URLs found in the page that end with / are assumed to be directories; those that don't end with / are assumed to be files.

In the interest of brevity and removing mostly useless information, directories are not returned unless verbosity is enabled; only files.

To use wgetls, you pass a minimum of two parameters: the first is a URL, and the second is either a filename, which can be '-' to cause everything to be dumped to stdout, or a real filename, or one of 3 options that negate the requirement of a filename (but display the results of scanning the server anyway).

So, as an example, if you pass http://example.com/ (no, that isn't a real listing page), it might return:

http://example.com/somedir/file1
http://example.com/anotherdir/hello
http://example.com/anotherdir/world

when the structure of example.com might actually have had 16 other dirs in it, all of which were empty.

Here's the output of wgetls --help:

wgetls is a link retrieval tool for HTML pages.

usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]
              [-i str] [-j str] [-d lchX]

you can specify options anywhere (even together, like `-td'), but `URL' has
to come before `listfile', even if option(s) are interspersed between them.

the option parser is very lenient, and will consider stuff like `-dt X' valid.

general options:
  URL       the URL to look at. cURL supports many types of URLs, but
            this program only makes use of normal HTTP URLs.
  listfile  the file to output the listing to; may be `-' for stdout.
            if you omit this option, a listing will not be created, but so
            this program does not execute a run in vain, you can't omit this
            option unless you specify -t or -d l instead.
  -h        ...
  -t        print a tree of the listing as it's discovered to stderr.
            the tree code isn't perfect, but it does work.
  -p        by default, wgetls will prepend the URL you passed to all
            the subsequent URLs it finds. this will turn that feature off.
  -o        overwrite `listfile' if it exists
  -a        append to `listfile' if it exists

matching options (can be combined):
by default, wgetls will only follow a URL if it ends with `/'. however...
  -f        ...will cause wgetls to follow every link it finds. this is almost
               certainly not what you want. this invalidates the 4 other
               options below.
  -s str    ...will cause the URL to only be followed if it starts with `str'
  -e str    ...will cause the URL to only be followed if it ends with `str'
  -i str    ...will cause the URL to not be followed it starts with `str'.
               regardless of what you use here, if a URL ends with "../" it
               will be skipped.
  -j str    ...will cause the URL to not be followed if it ends with `str'

debugging params that work with -d (which can, you guessed it, be combined):
   l       log everything as it happens - absolutely everything. this
           includes files/dirs found, parser info, and so on.
   c       set cURL's CURLOPT_VERBOSE value on.
   h       print HTTP responses recieved.
   X       print the value of `i' (the main parser index variable) to stdout
           constantly. only use this if the program crashes, so you have a
           way of knowing where it died, because this can make a terminal
           get very messy...

warning: this program is prone to segfaulting - don't worry if it crashes.
         just poke about with the various debug options to see what's going
         wrong.

Written by David Lindsay <dav7@dav7.net> 16-21st Dec 08.

>> This is public domain software, and you're using version 3.2. <<

The code

Notes
- You will need libcurl ('curl' package) installed for this to work, since wgetls uses libcurl for HTTP handling.
- Both gcc and tcc (a small, 32-bit-only compiler for x86) successfully compile wgetls.
- Under gcc, compilation with -Wall and -pedantic produce no errors. Since wgetls uses the non-ANSI fsync() and snprintf() functions, compiling with -ansi will produce two warnings referencing the aforementioned functions.

Compile with <gcc or tcc> -o wgetls wgetls.c -lcurl.

#include <curl/curl.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

#define VERSION "3.2"

CURL *curl_handle;

int tree = 0;
int debuglog = 0, debughttp = 0, debugindex = 0;
int noprepend = 0;
int alwaysfollow = 0;
int append = 0, overwrite = 0;
char *linkprefix = NULL, *linksuffix = "/";
int linkprefixlen = 0, linksuffixlen = 0;
char *blacklistprefix = NULL, *blacklistsuffix = NULL;
int blacklistprefixlen = 0, blacklistsuffixlen = 0;
char *listfile = NULL;

int file, nestlevel = 0;

char **pagedata;
int *pagesize;

void memerr() {
    
    printf("wgetls: fatal memory error!\nQuitting, your data up to this point was hopefully saved to `%s'\n", listfile);
    exit(0);
    
}

void readpage(char *url) {
    
    int i;
    char *pos;
    int k;
    char tmp;
    char *link;
    int nonempty = 0;
    char **charptr;
    int *intptr;
    int s;
    int urllen = strlen(url);
    char quotestr[40];
    int follow;
    int linklen;
    
    int foundlink = 0, href = 0, inquote = 0;
    
    if (debuglog) fprintf(stderr, "entering readpage(), nestlevel = %d\n", nestlevel);
    
    curl_easy_setopt(curl_handle, CURLOPT_URL, url);
    curl_easy_perform(curl_handle);
    
    if (debughttp) printf("-- start HTTP response --\n%s\n-- end HTTP response --\n", pagedata[nestlevel]);
    
    pos = strstr(pagedata[nestlevel], "Parent Directory");
    i = (pos == NULL ? 0 : pos - pagedata[nestlevel]);
    
    while(i < pagesize[nestlevel]) {
        if (debugindex) printf("\n\n>> parser index: %d\n\n", i);
        if (!strncmp(pagedata[nestlevel] + i, "<a", 2)) {
            foundlink = 1;
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found '<a' at offset %d: %s...\n", i, quotestr);
        }
        if (foundlink && !strncmp(pagedata[nestlevel] + i, "href", 4)) {
            href = 1;
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found 'href' attr at offset %d: %s...\n", i, quotestr);
        }
        if (foundlink && href && pagedata[nestlevel][i] == '"') {
            snprintf(quotestr, 40, "%ss", pagedata[nestlevel] + i);
            if (debuglog) fprintf(stderr, "found quote (\") at offset %d: %s...\n", i, quotestr);
            if (!inquote) {
                if (debuglog) fprintf(stderr, "entering quoted string at offset %d\n", i);
                inquote = 1;
                s = i + 1;
            } else if (inquote) {
                if (debuglog) fprintf(stderr, "string end found at offset %d\n", i);
                
                follow = 0;
                
                if (linkprefix != NULL)
                    if (!strncmp(pagedata[nestlevel] + s, linkprefix, linkprefixlen))
                        follow = 1;
                
                if (linksuffix != NULL)
                    if (!strncmp(pagedata[nestlevel] + i - 1 + linksuffixlen, linksuffix, linksuffixlen))
                        follow = 1;
                  
                if (blacklistprefix != NULL)
                    if (!strncmp(pagedata[nestlevel] + s, blacklistprefix, blacklistprefixlen))
                        follow = 0;
                
                if (blacklistsuffix != NULL)
                    if (!strncmp(pagedata[nestlevel] + i - 1 - blacklistsuffixlen, blacklistsuffix, blacklistsuffixlen))
                        follow = 0;
                
                if (alwaysfollow) follow = 1;
                
                if (follow) {
                    
                    nonempty = 1;
                    
                    tmp = *(pagedata[nestlevel] + i);
                    *(pagedata[nestlevel] + i) = '\0';
                    
                    if (debuglog) fprintf(stderr, "found a valid URL: \"%s\"\n", pagedata[nestlevel] + s);
                    
                    if (*(pagedata[nestlevel] + i - 1) == '/') {
                        
                        if (debuglog) fprintf(stderr, "this is a directory.\n");
                        
                        charptr = realloc(pagedata, (nestlevel + 2) * sizeof(char *));
                        if (!charptr) memerr();
                        pagedata = charptr;
                        
                        intptr = realloc(pagesize, (nestlevel + 2) * sizeof(int));
                        if (!intptr) memerr();
                        pagesize = intptr;
                        
                        pagedata[nestlevel + 1] = calloc(0, 1);
                        pagesize[nestlevel + 1] = 0;
                        
                        link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);
                        
                        if (!noprepend) memcpy(link, url, urllen);
                        memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s));
                        
                        linklen = strlen(link);
                        
                        if (nestlevel > 0 && linklen >= 3) {
                            if (!strncmp(link + (linklen - 3), "../", 3)) {
                                if (debuglog) fprintf(stderr, "...aaand it's the previous directory. no thanks, backing out.\n");
                                goto cont; /* 151 */
                            }
                        }
                        
                        if (debuglog) fprintf(stderr, "entering directory: '%s'\n", link);
                        
                        if (tree) {
                            printf("  ");
                            for (k = 0; k < nestlevel; k++) printf("   ");
                            printf("+- %s\n", pagedata[nestlevel] + s);
                        }
                        
                        nestlevel++;
                        
                        readpage(link);
                        
                        nestlevel--;
                        
                        if (debuglog) fprintf(stderr, "returned to nestlevel %d\n", nestlevel);
                        
                        cont:
                        
                        if (debuglog) fprintf(stderr, "leaving directory: '%s'\n", link);
                        
                        free(link);
                        
                    } else {
                        
                        link = calloc((noprepend ? 0 : urllen) + strlen(pagedata[nestlevel] + s) + 100, 1);
                        
                        if (!noprepend) memcpy(link, url, urllen);
                        memcpy(link + (noprepend ? 0 : urllen), pagedata[nestlevel] + s, strlen(pagedata[nestlevel] + s - 1));
                        
                        if (debuglog) fprintf(stderr, "discovered file: %s\n", link);
                        
                        if (tree) {
                            printf("  ");
                            for (k = 0; k < nestlevel; k++) printf("   ");
                            printf("+- %s\n", pagedata[nestlevel] + s);
                        }
                        
                        if (debuglog) fprintf(stderr, "commiting URL to file \"%s\": %s\n", listfile, link);
                        
                        write(file, link, strlen(link));
                        write(file, "\n", 1);
                        
                        if (debuglog) fprintf(stderr, "syncing file \"%s\"\n", listfile);
                        
                        fsync(file);
                        
                        if (debuglog) fprintf(stderr, "sync OK.\n");
                        
                        free(link);
                        
                    }
                    
                    *(pagedata[nestlevel] + i) = tmp;
                    
                }
                
                foundlink = href = inquote = 0;
                
            }
            
        }
        
        i++;
        
    }
    
    
    if (!nonempty && tree) {
        
        printf("  ");
        for (k = 0; k < nestlevel; k++) printf("   ");
        printf("+- (%s)\n", !nestlevel ? "no URLs" : "empty");
        
    }
    
    if (debuglog) fprintf(stderr, "leaving readpage(), nestlevel %d\n", nestlevel);
    
}

void showsyntax() {
    
    puts("usage: wgetls <URL> [listfile] [-h] [-t] [-p] [-o] [-f] [-s str] [-e str]");
    puts("              [-i str] [-j str] [-d lchX]");
    
}

void usage() {
    
    puts("wgetls is a link retrieval tool for HTML pages.\n");
    showsyntax();
    printf(
        "\nyou can specify options anywhere (even together, like `-td'), but `URL' has\n"
        "to come before `listfile', even if option(s) are interspersed between them.\n"
        "\nthe option parser is very lenient, and will consider stuff like `-dt X' valid.\n\n"
    );
    printf(
        "general options:\n"
        "  URL       the URL to look at. cURL supports many types of URLs, but\n"
        "            this program only makes use of normal HTTP URLs.\n"
        "  listfile  the file to output the listing to; may be `-' for stdout.\n"
        "            if you omit this option, a listing will not be created, but so\n"
        "            this program does not execute a run in vain, you can't omit this\n"
        "            option unless you specify -t or -d l instead.\n"
    );
    printf(
        "  -h        ...\n"
        "  -t        print a tree of the listing as it's discovered to stderr.\n"
        "            the tree code isn't perfect, but it does work.\n"
        "  -p        by default, wgetls will prepend the URL you passed to all\n"
        "            the subsequent URLs it finds. this will turn that feature off.\n"
        "  -o        overwrite `listfile' if it exists\n"
        "  -a        append to `listfile' if it exists\n\n"
    );
    printf(
        "matching options (can be combined):\n"
        "by default, wgetls will only follow a URL if it ends with `/'. however...\n"
        "  -f        ...will cause wgetls to follow every link it finds. this is almost\n"
        "               certainly not what you want. this invalidates the 4 other\n"
        "               options below.\n"
    );
    printf(
        "  -s str    ...will cause the URL to only be followed if it starts with `str'\n"
        "  -e str    ...will cause the URL to only be followed if it ends with `str'\n"
        "  -i str    ...will cause the URL to not be followed it starts with `str'.\n"
        "               regardless of what you use here, if a URL ends with \"../\" it\n"
        "               will be skipped.\n"
        "  -j str    ...will cause the URL to not be followed if it ends with `str'\n"
        "\ndebugging params that work with -d (which can, you guessed it, be combined):\n"
    );
    printf(
        "   l       log everything as it happens - absolutely everything. this\n"
        "           includes files/dirs found, parser info, and so on.\n"
        "   c       set cURL's CURLOPT_VERBOSE value on.\n"
        "   h       print HTTP responses recieved.\n"
        "   X       print the value of `i' (the main parser index variable) to stdout\n"
        "           constantly. only use this if the program crashes, so you have a\n"
        "           way of knowing where it died, because this can make a terminal\n"
        "           get very messy..."
    );
    printf(
        "\nwarning: this program is prone to segfaulting - don't worry if it crashes.\n"
        "         just poke about with the various debug options to see what's going\n"
        "         wrong.\n"
        "\nWritten by David Lindsay <dav7@dav7.net> 16-21st Dec 08.\n\n"
        ">> This is public domain software, and you're using version %s. <<\n\n", VERSION
    );
    
    exit(0);
    
}

static size_t storedata(void *data, size_t size, size_t nmemb) {
    int len = size * nmemb;
    char *tmp = NULL;
    tmp = realloc(pagedata[nestlevel], pagesize[nestlevel] + len + 1);
    if (!tmp) memerr();
    pagedata[nestlevel] = tmp;
    memcpy(pagedata[nestlevel] + pagesize[nestlevel], data, len);
    pagesize[nestlevel] += len;
    return len;
}

void argerror() {
    
    showsyntax();
    puts("try `wgetls --help' for info on what this program does.");
    exit(0);
    
}

void checkcmd(char *arg) {
    
    if (!strstr(arg, "h")) {
        printf("wgetls: unknown option `%s'\n", arg);
        argerror();
    } else {
        usage();
    }
    
}

void neednextarg(int i, int argc, char c) {
    
    if (i == argc - 1) {
        printf("wgetls: option expected for '-%c'; see -h for help.\n", c);
        exit(0);
    }
    
}

int main(int argc, char *argv[]) {
    
    int i, j, k;
    int state = 0;
    char *url = NULL;
    int len, len2;
    struct stat statbuf;
    int skip = 0;
    int filestat;
    
    curl_global_init(CURL_GLOBAL_ALL);
    curl_handle = curl_easy_init();
    
    for (i = 1; i < argc; i++) {
        len = strlen(argv[i]);
        if (skip) { skip = 0; continue; }
        if (argv[i][0] == '-' && len > 1) {
            for (j = 1; j < len; j++) {
                switch(argv[i][j]) {
                    case 't':
                        tree = 1;
                        break;
                    case 'p':
                        noprepend = 1;
                        break;
                    case 'o':
                        overwrite = 1;
                        break;
                    case 'a':
                        append = 1;
                        break;
                    case 'f':
                        alwaysfollow = 1;
                        break;
                    case 's':
                        neednextarg(i, argc, 's');
                        linkprefix = argv[i + 1];
                        linkprefixlen = strlen(linkprefix);
                        skip = 1;
                        break;
                    case 'e':
                        neednextarg(i, argc, 'e');
                        linksuffix = argv[i + 1];
                        linksuffixlen = strlen(linksuffix);
                        skip = 1;
                        break;
                    case 'i':
                        neednextarg(i, argc, 'i');
                        blacklistprefix = argv[i + 1];
                        blacklistprefixlen = strlen(blacklistprefix);
                        skip = 1;
                        break;
                    case 'j':
                        neednextarg(i, argc, 'j');
                        blacklistsuffix = argv[i + 1];
                        blacklistsuffixlen = strlen(blacklistsuffix);
                        skip = 1;
                        break;
                    case 'd':
                        neednextarg(i, argc, 'd');
                        len2 = strlen(argv[i + 1]);
                        for (k = 0; k < len2; k++) {
                            switch(argv[i + 1][k]) {
                                case 'l':
                                    debuglog = 1;
                                    break;
                                case 'c':
                                    curl_easy_setopt(curl_handle, CURLOPT_VERBOSE, 1);
                                    break;
                                case 'h':
                                    debughttp = 1;
                                    break;
                                case 'X':
                                    debugindex = 1;
                                    break;
                                default:
                                    printf("wgetls: -d: unknown option `%c'. see -h for help.\n", argv[i + 1][k]);
                                    exit(0);
                            }
                        }
                        skip = 1;
                        break;
                    default:
                        checkcmd(argv[i] + 1);
                }
            }
        } else {
            switch(state) {
                case 0:
                    url = argv[i];
                    break;
                case 1:
                    listfile = argv[i];
                    break;
                default:
                    puts("wgetls: too many options");
                    argerror();
            }
            state++;
        }
    }
    
    if (append && overwrite) {
        
        puts("wgetls: error: user cannot make up their mind as to whether");
        puts("they want to append to or overwrite an existing file");
        argerror();
        
    }
    
    if (url == NULL) {
        puts("wgetls: missing URL");
        argerror();
    }
    
    if (listfile == NULL) {
        
        printf("wgetls: ");
        printf((debuglog || tree) ? "notice" : "error");
        printf(": listfile not specified\n");
        if (!(debuglog || tree)) {
            printf("wgetls: you cannot omit a listfile without using -t or -d X instead\n");
            exit(1);
        }
        
     } else {
         
        if (strcmp(listfile, "-") != 0) {
            
            filestat = stat(listfile, &statbuf);
            
            if (filestat != -1) {
                
                printf("wgetls: ");
                printf(overwrite ? "warning" : (append ? "notice" : "error"));
                printf(": file `%s' already exists%s\n", listfile, overwrite ? "; overwriting file" : append ? "; appending to file" : "");
                if (!overwrite && !append) exit(0);
                
            }
            
            if ((append && filestat == -1) || !append) {
                
                file = creat(listfile, 0600);
                
            }
            
            if (append) {
                
                file = open(listfile, O_RDWR | O_APPEND);
                
            }
            
            if (file == -1) {
                
                printf("wgetls: error accessing file `%s': %s\n", listfile, strerror(errno));
                exit(0);
                
            }
            
        }
        
    }
    
    curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, storedata);
    curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "wgetls/1.0");
    curl_easy_setopt(curl_handle, CURLOPT_FOLLOWLOCATION, 1);
    
    pagesize = calloc(1, sizeof(int));
    pagedata = calloc(1, sizeof(char *));
    
    pagedata[nestlevel] = calloc(0, 1);
    pagesize[nestlevel] = 0;
    
    if (debuglog) fprintf(stderr, "starting page scan of \"%s\"\n", url);
    
    if (tree) fprintf(stderr, "%s\n", argv[1]);
    
    readpage(url);
    
    if (debuglog) fprintf(stderr, "scan complete.\n");
    
    close(file);
    
    curl_global_cleanup();
    
    return 0;
    
}

Last edited by dav7 (2009-01-23 12:37:11)


Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.

Offline

#2 2009-01-23 10:46:09

dav7
Member
From: Australia
Registered: 2008-02-08
Posts: 674

Re: wgetls: a URL retrieval tool for HTML pages [v3.2]

Bump so everyone knows this is at version 3.1 now.

EDIT: fogobogo helped me discover a typo on line 456; "alredy" is "already" now. tongue

Last edited by dav7 (2009-01-23 10:57:54)


Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.

Offline

#3 2009-01-23 12:24:59

dav7
Member
From: Australia
Registered: 2008-02-08
Posts: 674

Re: wgetls: a URL retrieval tool for HTML pages [v3.2]

Update!

v3.1 has a major bug! Even though it may appear to work properly, strange characters may leak onto the end of URLs, and you may even get segfaults. Sorry I didn't properly test it before releasing it!

Technical details:

I calloc()ed the memory for the link href improperly, per:

link = calloc((noprepend ? 0 : urllen) + i - s, 1);

Once I fixed it with "+ 1", per:

link = calloc((noprepend ? 0 : urllen) + i - s + 1, 1);

it worked great.

I may use a method that doesn't use a calloc() in the future; this works for now, and I'm not gonna knock it...

Last edited by dav7 (2009-01-23 12:38:51)


Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.

Offline

Board footer

Powered by FluxBB