New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
What do you do with a drunken Googlebot?
impossiblystupid
Member
in General
I made a post to my blog about this over the weekend, but l want to open it up to a larger discussion here. Is there anything you do to keep spiders from behaving badly on your web sites? Something less severe than just banning their subnet at the firewall, of course. :-)
For example, for the random 404's that Google normally insists on bothering me with:
66.249.64.235 - - [26/May/2016:11:23:55 -0400] "GET /yrjclqajwyshc.html HTTP/1.1"
66.249.64.10 - - [27/May/2016:11:15:02 -0400] "GET /ysveybimgdu.html HTTP/1.1"
66.249.64.3 - - [02/Jun/2016:10:20:53 -0400] "GET /iqswwijkbkk.html HTTP/1.1"
66.249.64.243 - - [03/Jun/2016:10:11:18 -0400] "GET /qfmtujzxykv.html HTTP/1.1"`
I added the following to my site's .htaccess
file so that it gives a 204 response (No Content) instead of logging an error:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^[a-z]{8,16}.html$ http://www.google.com/ [R=204,L,CO=google:stop_your_404_probing:impossiblystupid.com]
Any other tips or tricks you use to keep the clutter in your log files to a minimum?
Comments
Googlebot is not drunk. Those requests are done on purpose.
Google is testing your site for proper a HTTP response code to non-existing files/documents.
You are complaining about one request per day?
To what end?
I'm not sure. Maybe testing for redirections?
This is nothing new, these odd filename requests having been going on for years by Googlebot.
This reminds me of people who claim that their VPS is generating extreme load because Google is crawling their site.
What globalregisters said.
Google for 'soft 404'. Much easier for Google to check whether you return a 404 when it's nearly certain you should, rather than attempting to guess by the wordage of your page and (non-404) HTTP response.
What it basically means is that Google has a lower confidence level about the content that is served, because it can't be sure it's some fancy (and possibly temporary) error page or a document that's useful to satisfy a user's query.
Never said they had no purpose . . . for Google. For me, they're just an annoying "error" that was getting logged, so I decided to change that and thought I'd share.
I'd argue that 204 is more proper than 404 here.
No, I'm opening a discussion about all kinds of log entries that get generated by poorly written spiders. If the error threshold that gets your attention is higher than mine, I still welcome you to share the techniques you use to stop them in their tracks.
For Googlebot and all other robots.txt obeying spiders
User-agent: * Disallow: /
You can take whatever stand you want.
If you don't care about Google's opinion of your site, then serve whatever suits you.
The fact remains that Googlebot wants to see a 404 in this situation.
It shouldn't be about taking a stand, but deciding to do the correct thing. It is wrong for Google to be making up random URLs it has every reason to think will not lead to content. It is right to respond to them with a "no content" result.
Do you have a documented reference for that "fact"? I can see why they might not want a 200 response (i.e., a soft 404). But a 204 should be seen as an even better response than a 404 to a request for content that is known to not exist.
Isn't that the webmaster tools verification file?
They prove it to see if particular account/token is still allowed access.
Except the 200 series of errors indicate an acceptable response, a 400 series of errors indicates that the client was somehow wrong to make the request it did.
The RFC would seem to disagree with you; regarding 204 it states:
The 204 (No Content) status code indicates that the server has successfully fulfilled the request and that there is no additional content to send in the response payload body.
Whereas 404 states:
The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists.
I get what you're saying... Google knows the request is not valid so if you're server knows that Google knows, 204 should be its response. But your server doesn't know and Google (or anything, for that matter) is just picking a random URL so your server should respond appropriately, with a 404.
I interpret 204 to be something like this: You have a WYSIWYG editor with a user-configurable toolbar, the configuration of said toolbar is stored server-side. When the user makes a change, a request is placed to the server to store that information; if the request succeeds that's 204--everything worked but the server has nothing else to say. If it fails, that's 4xx or 5xx, depending on why. (The above-linked RFC also provides a similar example for 204).
No it's not. What you are referring to is a file to verify you are in control of the website content for a domain.
This is a request for a web page that should not exist on the site and thus is expecting a 404.
I'd respond with a 404 not a 204.
I was aware of the RFC definitions before I picked that response code. It doesn't really disagree with me, either; it's wording is, at best, poorly chosen. I mean, I am successfully fulfilling the request without additional content, therefore a 204. I did find the "current representation" for Google's request (a lot of nothing :-), and I'm fully disclosing that it is nothing, so it's not a 404.
But it does now. I put the directive there myself! I get that the fallback position should not be a 200 (in most cases, anyway), but there are many other completely valid ways to handle such a URL than just kicking out a 404.
It certainly can be, and maybe should be. But I'll wager that for modern web services, you actually get a lot more 200 responses to that than 204's.
To me it comes down to this: is Google broken or operating as designed when it sends these bad requests? If it is broken, it should of course get a 404 back so they can fix their spider (and/or I can fix my server). By all accounts, though, it is operating as intended when it intentionally spiders non-existant URLs, so it really should be getting a 2XX response of some kind on a well-run server.
By extension, if we look at "common" files that are expected to exist on most servers (like
robots.txt
orfavicon.ico
or any of the newer types that Apple is assuming everyone should start using), what is the "proper" response for them? They're nothing special, so they result in a 404 like anything else that isn't found. But not having them isn't really an error. And the solution to people who don't want them (and don't want the pseudo-error logged) should not be to create empty files all over the place and then return a 200. If I know what they want and I know I don't have it, what is a better response than a 204?>
I see where you're coming from. At the same time I think the answer most people would respond to your question with is 404.
I just let the Googlebots do their thing because I get a lot of good traffic from Google and don't really want to mess it up.
The intention of a 204 status code is generally to indicate that you've successfully fulfilled a non-idempotent request (think eg. adding an item through an API), but there's nothing more to send back other than "yeah, that worked".
A 204 is explicitly not meant to indicate that "I don't have anything for this URL" - that's what a 404 is for. A 204 is meant when you do have something for the URL, it just doesn't have a response payload associated with it. Hence its use in non-idempotent requests, where that "something" is a request handler.
I don't see that as being the clear intent of the RFC, as I noted in my reply to JustAMacUser. You're welcome to quote and parse the specific wording you think supports your case. I maintain that if the client isn't making a request in good faith, it is perfectly reasonable for my server to respond with not an error, but "I see what you're trying to do, and . . ."
That is exactly what I have.
Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."
Totally happens. Load up Wordpress with a crappy theme filled with more ajax than anything ever should, install 150 plugins, 2 or more of which should be for ecommerce, 3 for security. Should blow a couple of CPU cores on every page load
I wish I couldn't say that I've worked on this theoretical site many many times.
A 418 might be an appropriate response for an impasse.
Ok, I don't understand your logic if a page doesn't exist it's a 404 not 204 as 2xx are for success.
You don't have to be a arse with your willy wonka meme as @joepie91 was only trying to point out your flawed logic.
That's because the RFCs were 1) not very detailed to begin with (in part because HTTP is meant to be generically usable), 2) not correctly implemented everywhere, and 3) the de facto implementations mostly decided the "real" meaning of HTTP status codes.
The specifications are quite clear about the intent of the 404 status code:
Logically speaking, and according to the usual rule of specifications that "more specific rules trump less specific rules", that means that for a non-existent resource, you are to use 404 unless it should be a 410 instead - and the 204 status code is meant for cases where ...
... but excluding the cases where the resource is not found, since that is already covered by 404. Further, "fulfilled" here means "processing the request by serving the requested resource". If a resource is not found, you cannot do that.
Yes, it is. The client requested a resource that does not exist. That's a client error.
It's not your job to try and determine the intentions of a client. The whole point of how HTTP is designed, is that it's to be both server-neutral and client-neutral - stateless messages that provide only objective information, and it's up to the receiving party to interpret it according to its own expectations.
Hence, you return a 404 for "not found", and let Googlebot worry about what a 404 means for its purpose.
Look, you can try to redefine status codes and "well, technically" your way through it, but the reality is that you are going to break shit, because you're violating the expectations the clients have. Just stick with the spec, and when the spec is unclear, work on understanding what the spirit of the specification is, and how it is commonly implemented. There's really no discussion to be had here.
Since Googlebot is drunk, drink with it!
There is! Well, sort of:
204 says no content. If you serve any HTML your doing it wrong. A 204 should be a white page only. 404 is correct useage for not finding a client specified page