Bookmark and Share

Download web pages recursively under an URL1

 wget \
 --recursive \
 --no-clobber \
 --page-requisites \
 --html-extension \
 --convert-links \
 --restrict-file-names=windows \
 --domains example.com \
 -nH --cut-dirs=some_subdir \
 -e robots=off \
 --random-wait \
 --wait 5 \
 --no-parent \
     www.example.com/subdirectory/
  • Substitute example.com and www.example.com/subdirectory/ with relevant expressions in your problem.
  • --recursive: download the entire Web site.
  • --domains website.org: don't follow links outside website.org.
  • --no-parent: don't follow links outside the directory subdirectory.
  • --page-requisites: get all the elements that compose the page (images, CSS and so on).
  • --html-extension: save files with the .html extension.
  • --convert-links: convert links so that they work locally, off-line.
  • --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
  • --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
  • -e robots=off: force crawling regardless of robots.txt setting.
  • -nH --cut-dirs=some_subdir: cuts out hostname and subdirectory name.
  • --random-wait: randomizes the time between requests to vary between 0.5 and 1.5 times of the waiting time specified by the --wait option.
  • --wait 5: number of seconds to wait between requests. (See --random-wait.)

References


  1. linuxjournal.com. Downloading an Entire Web Site with wget. 2008. https://www.linuxjournal.com/content/downloading-entire-web-site-wget
blog comments powered by Disqus