When crawling a page, it fails with a "forbidden" error (code 403)
As well as trying to access a resource you simply don't have
permission for, web servers can return
403 (Forbidden) status
for other reasons. This article two other common cases Cyotek
Incorrect handling of HEAD requests
This first case can occur when you try to crawl a website which
doesn't allow the
HEAD method, but returns
403 rather than
405 (Method Not Allowed).
WebCopy 1.8 and above will try to mitigate this by testing with
HEADfails and automatically disable header checking for affected domains.
While the HTTP protocol defines a number of methods, WebCopy
makes use of only three of these -
By default, WebCopy issues the
HEAD method before crawling any
URI which provides important information such as the content
type and length before actually trying to download any content.
This speeds up crawls where you are excluding content types that
belong to large binary files. However, if the web server doesn't
support, or has disabled, the
HEAD method, then any crawl of
that server will fail.
If this happens, you need to disable the use of the
command by WebCopy. To do this, display the Project
Properties dialogue, select the Advanced category, then
uncheck Use Header Checking. Click OK to save your
changes and close the dialogue, then retry the crawl.
How can I check up front if HEAD is supported?
You can use the Test URI feature of WebCopy to determine if the
URI you want to crawl supports the
HEAD method. Simply click
Test URI from the toolbar, enter the URL of the site to test,
Test. WebCopy will try and retrieve the headers, and
will notify you of any problems. You can then use the same
window to switch to
GET and see if this works.
Servers rejecting custom user agents
Another common case is where a server checks the client user
agent and returns
403 if it doesn't match what it is
expecting. In this scenario, changing the user agent to mimic
one used a traditional web browser may help.
WebCopy 1.9 and above will try to mitigate this by testing with the default user agent and in the event of
403will retest with a generic agent and automatically use this if successful
To change the user agent, display the Project Properties dialogue and select the User Agent category. Next select Use custom user agent and either enter a custom value or choose from a pre-defined list. Click OK to save your changes and close the dialogue, then retry the crawl.
- 2013-04-28 - First published
- 2021-04-02 - Updated to include user agents