When crawling a page, it fails with a "forbidden" error (code 403)
As well as trying to access a resource you simply don't have
permission for, web servers can return 403 (Forbidden)
status
for other reasons. This article two other common cases Cyotek
has identified.
Incorrect handling of HEAD requests
This first case can occur when you try to crawl a website which
doesn't allow the HEAD
method, but returns 403
rather than
405 (Method Not Allowed)
.
WebCopy 1.8 and above will try to mitigate this by testing with
GET
if aHEAD
fails and automatically disable header checking for affected domains.
While the HTTP protocol defines a number of methods, WebCopy
makes use of only three of these - HEAD
, GET
and POST
.
By default, WebCopy issues the HEAD
method before crawling any
URI which provides important information such as the content
type and length before actually trying to download any content.
This speeds up crawls where you are excluding content types that
belong to large binary files. However, if the web server doesn't
support, or has disabled, the HEAD
method, then any crawl of
that server will fail.
If this happens, you need to disable the use of the HEAD
command by WebCopy. To do this, display the Project
Properties dialogue, select the Advanced category, then
uncheck Use Header Checking. Click OK to save your
changes and close the dialogue, then retry the crawl.
How can I check up front if HEAD is supported?
You can use the Test URI feature of WebCopy to determine if the
URI you want to crawl supports the HEAD
method. Simply click
Test URI
from the toolbar, enter the URL of the site to test,
and click Test
. WebCopy will try and retrieve the headers, and
will notify you of any problems. You can then use the same
window to switch to GET
and see if this works.
Servers rejecting custom user agents
Another common case is where a server checks the client user
agent and returns 403
if it doesn't match what it is
expecting. In this scenario, changing the user agent to mimic
one used a traditional web browser may help.
WebCopy 1.9 and above will try to mitigate this by testing with the default user agent and in the event of
403
will retest with a generic agent and automatically use this if successful
To change the user agent, display the Project Properties dialogue and select the User Agent category. Next select Use custom user agent and either enter a custom value or choose from a pre-defined list. Click OK to save your changes and close the dialogue, then retry the crawl.
More Information
Update History
- 2013-04-28 - First published
- 2021-04-02 - Updated to include user agents