Download web pages recursively under an URL1
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains example.com \
-nH --cut-dirs=some_subdir \
-e robots=off \
--random-wait \
--wait 5 \
--no-parent \
www.example.com/subdirectory/
- Substitute
example.com
and
www.example.com/subdirectory/
with relevant
expressions in your problem.
--recursive
: download the entire Web site.
--domains website.org
: don't follow links outside
website.org.
--no-parent
: don't follow links outside the
directory subdirectory
.
--page-requisites
: get all the elements that
compose the page (images, CSS and so on).
--html-extension
: save files with the .html
extension.
--convert-links
: convert links so that they work
locally, off-line.
--restrict-file-names=windows
: modify filenames so
that they will work in Windows as well.
--no-clobber
: don't overwrite any existing files
(used in case the download is interrupted and resumed).
-e robots=off
: force crawling regardless of
robots.txt setting.
-nH --cut-dirs=some_subdir
: cuts out hostname and
subdirectory name.
--random-wait
: randomizes the time between
requests to vary between 0.5 and 1.5 times of the waiting time
specified by the --wait
option.
--wait 5
: number of seconds to wait between
requests. (See --random-wait
.)
References