使用wget轻松下载文件列表
本文完整阅读约需 16 分钟,如时间较长请考虑收藏后慢慢阅读~
很多时候我们会遇到需要批量下载一个文件列表中所有文件的场景,例如下载本地镜像源、下载资源集合之类,这个时候我们可能会求助于第三方工具,例如迅雷等。但其实MacOS&Linux自带的wget已经足够强大,能够帮助我们实现这一需求。
命令只有一条,很简单却很高效:
wget --execute="robots = off" --mirror --convert-links --no-parent --tries=5 --wait=5 --page-requisites --limit-rate=300k https://example.com/file
该命令详细解析如下:
--execute
该参数与.wgetrc
配置同理,在这里配置robots = off
是为了避免下载robots.txt
文件,该文件通常是我们不需要的。
.wgetrc
配置详细列表可看https://www.gnu.org/software/wget/manual/html_node/Wgetrc-Commands.html
如果有多条命令需要
--execute
怎么办呢?
——根据wget的manual所示:只需要多次调用--execute
即可。
--mirror
该参数是本条命令的核心,即下载镜像中所有链接指向的资源,可理解为递归下载,因为通常用途是为了创建镜像站,故参数名称为mirror
--convert-links
该参数可以将所有指向在线链接的URL转换为本地URL,以实现脱机浏览的时候不访问本机以外的URL,通常是在所有文件都下载完成后进行处理,处理方式是转换为相对链接(前提是该文件必须已经下载)
--no-parent
该参数的目的是避免下载url的上级别资源。
通常在文件列表中我们会遇到.
和..
的超链接,这些超链接的用途是便于我们在目录之间导航,但是对于我们批量下载的需求来说,我们不需要wget去索引所需目录以外的资源,即不需要wget访问父目录。
--tries
该参数的目的是为了方便在源站点无法访问或出现堵塞的时候进行自动重试,参数的值为重试次数。
--wait
该参数的目的是每下载一个文件后sleep多少秒,以降低对源服务器的负载,也减小在下载大量文件时被Ban的可能性。
--page-requisites
该参数用于下载站点中所有相关(即能让每个页面都能正常显示)的静态资源(尤其是在递归下载的时候),例如图片、音频、CSS等。
该参数起到的更多是辅助作用,具体例子可参阅Wget官方文档:
‘-p’
‘--page-requisites’
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display it properly are not downloaded. Using ‘-r’ together with ‘-l’ can help, but since Wget does not ordinarily distinguish between external and inlined documents, one is generally left with “leaf documents” that are missing their requisites.
For instance, say document 1.html contains an <IMG> tag referencing 1.gif and an <A> tag pointing to external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to 3.html. Say this continues up to some arbitrarily high number.
If one executes the command:
wget -r -l 2 http://site/1.html
then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be downloaded. As you can see, 3.html is without its requisite 3.gif because Wget is simply counting the number of hops (up to 2) away from 1.html in order to determine where to stop the recursion. However, with this command:
wget -r -l 2 -p http://site/1.html
all the above files and 3.html’s requisite 3.gif will be downloaded. Similarly,
wget -r -l 1 -p http://site/1.html
will cause 1.html, 1.gif, 2.html, and 2.gif to be downloaded. One might think that:
wget -r -l 0 -p http://site/1.html
would download just 1.html and 1.gif, but unfortunately this is not the case, because ‘-l 0’ is equivalent to ‘-l inf’—that is, infinite recursion. To download a single HTML page (or a handful of them, all specified on the command-line or in a ‘-i’ URL input file) and its (or their) requisites, simply leave off ‘-r’ and ‘-l’:
wget -p http://site/1.html
Note that Wget will behave as if ‘-r’ had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
wget -E -H -k -K -p http://site/document
To finish off this topic, it’s worth knowing that Wget’s idea of an external document link is any URL specified in an <A> tag, an <AREA> tag, or a <LINK> tag other than <LINK REL="stylesheet">.
--limit-rate
该参数用于对下载进行限速,参数值可以是以下列表中的任意一个:
– xxx
k
– xxx
m
注意:单位都是小写,实际下载速度为KB|MB/s而非Kb|Mb/s