WEBGRAB(1)                                             WEBGRAB(1)

     NAME
          webgrab - fetch web page content as files

     SYNOPSIS
          webgrab [ -r ] [ -v ] [ -o stem ] [ -p body ] url

     DESCRIPTION
          Webgrab connects to the web server named in the url. It
          fetches the content of the web page also determined by the
          url, and stores it locally in a file.  If the page is writ-
          ten in HTML, webgrab reads it to build a list of sub-
          component pages (eg, frames) and images.  It fetches those,
          saving the content in separate files.  It adds a comment to
          the end of each HTML file giving the time, and the file's
          origin.  It automatically follows redirections offered by
          the server.

          The stem of the names of the output files is normally
          derived from a component of the url. If the url contains a
          path name, the stem is the component of that path, less any
          dot-separated suffix and prefix.  For example, given

               http://www.vitanuova.com/inferno/old.index.html

          the stem would be index.  If there is no path name, but the
          url contains a domain name, the stem is the penultimate com-
          ponent of the domain name (eg, excluding trailing .com, and
          initial www, etc).  For example, given

               www.innerhost.vitanuova.com

          the stem would be vitanuova.  If all else fails, webgrab
          uses the stem webgrab.

          Given a stem, the initial page is stored in stem.suffix
          where suffix is the suffix (eg, .html) of the name of the
          original page.  Subordinate pages are saved in a similar way
          in files named stem_1.suffix1, stem_2.suffix2, ... .

          The options are:

          -r   do not fetch subcomponents (just the `raw' source of
               url itself)

          -v   print a progress report

          -vv  print a chatty progress report

          -o stem
               use the stem as given

     Page 1                       Plan 9             (printed 4/16/24)

     WEBGRAB(1)                                             WEBGRAB(1)

          -p body
               Use HTTP POST instead of GET, posting body as the data

          Webgrab reads the configuration file /services/webget/config
          (if it exists), to look for the address of an optional HTTP
          proxy (in the `httpproxy' entry), and list of domains for
          which a proxy should not be used (in the noproxy or
          noproxydoms entry). If symbolic network and service names
          might be involved, the connection server lib/cs needs to be
          already running.

     FILES
          /services/webget/config

     SOURCE
          /appl/cmd/webgrab.b

     BUGS
          It should read the proxy name from the charon(1) configura-
          tion file and not the webget configuration file.
          It cannot do `secure' transfers (https).
          Its HTML parsing is naive, but on the other hand, it is less
          likely to trip over HTML novelties.

     SEE ALSO
          cs(8)

     Page 2                       Plan 9             (printed 4/16/24)