Question

I've recently started paying greater attention to my 404 errors in order to clean up what I can and improve my site's SEO and ranking, and have noticed something that I don't understand.

In my 404 error log, I'm seeing quite a few searches conducted by user agents that look like this:

python-requests/2.23.0python-requests 2.23.0

And several that are similar.....but they are all requesting files that don't exist.

What are the python searches? Are they like bad bots? How do I block or prevent them?

I have a lot of bad bots too, and I found an older (2017) resource with some code to block them by User-Agent in my .htaccess file, which I implemented but it does not seem to be working - I still see logs of those bad bots also requesting mostly non-existent resources as well as a lot of posts with the /email or /print appended..... is there any truly effective way to block bad user-agents?

Était-ce utile?

La solution

What are the python searches? Are they like bad bots?

Most probably just "bad bots" searching for potential vulnerabilities.

How do I block or prevent them?

Well, you are already serving a 404 by the sounds of it, so it's really a non-issue. However, you can prevent the request from going through WordPress by blocking the request early in .htaccess, as you are already probably doing.

For example, at the top of your .htaccess file:

RewriteCond %{HTTP_USER_AGENT} python [NC]
RewriteRule ^ - [R=404]

The above sends a 404 Not Found for any request from a user-agent that contains "python" (not case-sensitive).

However, blocking by user-agent isn't necessarily that reliable since many "bad bots" pretend to be regular users.

I found an older (2017) resource with some code to block them by User-Agent in my .htaccess file, which I implemented but it does not seem to be working - I still see logs of those bad bots

If you block the "bad bot" in .htaccess you will still see the request in your server's access log. However, the log entry should show the HTTP status as 403 or 404 if it is blocked.

The only way to block the request entirely from hitting your server (and appearing in your server's logs) is if you have a frontend proxy-server / firewall that "screens" all your requests.

Autres conseils

User agents can be anything, it's the client that sets them, so I could make a curl request to your site and tell curl that my user agent is going to be "Tom is the best"

python-requests/2.23.0python-requests 2.23.0

This particular user agent implies the python requests library is making the request, but no clues as to what's using the library or why ( https://pypi.org/project/requests/ ).

As for blocking them, this is something you would do at a deeper level than WordPress. You seem to already be familiar with Apache HTAccess, there may be lower levels that they can be blocked at, or by your host or proxies. That would be beyond the scope of this site

As for why they're requesting non-existant resources, there could be lots of reasons:

  • A site elsewhere is referencing them and these bots are spidering over and hitting 404s
  • They're exploits, malware will regularly fire and forget their entire arsenal in hopes that one will work. They don't even bother to check what comes back, my WP site regularly gets hit with Drupal exploits despite them being completely ineffective.
  • Broken sitemaps!
  • Those assets might have been available on older sites that were on the domain before site rebuilds

The only way to know for sure is to somehow find somebody doing it and ask them, which isn't normally possible.

Licencié sous: CC-BY-SA avec attribution
Non affilié à wordpress.stackexchange
scroll top