What do you do with a drunken Googlebot?

impossiblystupid · June 2016

I made a post to my blog about this over the weekend, but l want to open it up to a larger discussion here. Is there anything you do to keep spiders from behaving badly on your web sites? Something less severe than just banning their subnet at the firewall, of course. :-)

For example, for the random 404's that Google normally insists on bothering me with:

66.249.64.235 - - [26/May/2016:11:23:55 -0400] "GET /yrjclqajwyshc.html HTTP/1.1"
66.249.64.10 - - [27/May/2016:11:15:02 -0400] "GET /ysveybimgdu.html HTTP/1.1"
66.249.64.3 - - [02/Jun/2016:10:20:53 -0400] "GET /iqswwijkbkk.html HTTP/1.1"
66.249.64.243 - - [03/Jun/2016:10:11:18 -0400] "GET /qfmtujzxykv.html HTTP/1.1"`

I added the following to my site's .htaccess file so that it gives a 204 response (No Content) instead of logging an error:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^[a-z]{8,16}.html$ http://www.google.com/ [R=204,L,CO=google:stop_your_404_probing:impossiblystupid.com]

Any other tips or tricks you use to keep the clutter in your log files to a minimum?

globalRegisters · June 2016

Googlebot is not drunk. Those requests are done on purpose.

Google is testing your site for proper a HTTP response code to non-existing files/documents.

rds100 · June 2016

You are complaining about one request per day?

Abdussamad · June 2016

globalRegisters said: Google is testing your site for proper a HTTP response code to non-existing files/documents.

To what end?

globalRegisters · June 2016

@Abdussamad said:

globalRegisters said: Google is testing your site for proper a HTTP response code to non-existing files/documents.

To what end?

I'm not sure. Maybe testing for redirections?

This is nothing new, these odd filename requests having been going on for years by Googlebot.

Damian · June 2016

This reminds me of people who claim that their VPS is generating extreme load because Google is crawling their site.

ricardo · June 2016

What globalregisters said.

Google for 'soft 404'. Much easier for Google to check whether you return a 404 when it's nearly certain you should, rather than attempting to guess by the wordage of your page and (non-404) HTTP response.

What it basically means is that Google has a lower confidence level about the content that is served, because it can't be sure it's some fancy (and possibly temporary) error page or a document that's useful to satisfy a user's query.

impossiblystupid · June 2016

@globalRegisters said:
Googlebot is not drunk. Those requests are done on purpose.

Never said they had no purpose . . . for Google. For me, they're just an annoying "error" that was getting logged, so I decided to change that and thought I'd share.

Google is testing your site for proper a HTTP response code to non-existing files/documents.

I'd argue that 204 is more proper than 404 here.

impossiblystupid · June 2016

@rds100 said:
You are complaining about one request per day?

No, I'm opening a discussion about all kinds of log entries that get generated by poorly written spiders. If the error threshold that gets your attention is higher than mine, I still welcome you to share the techniques you use to stop them in their tracks.

ricardo · June 2016

impossiblystupid said: No, I'm opening a discussion about all kinds of log entries that get generated by poorly written spiders. If the error threshold that gets your attention is higher than mine, I still welcome you to share the techniques you use to stop them in their tracks.

For Googlebot and all other robots.txt obeying spiders

User-agent: * Disallow: /

globalRegisters · June 2016

@impossiblystupid said:
I'd argue that 204 is more proper than 404 here.

You can take whatever stand you want.

If you don't care about Google's opinion of your site, then serve whatever suits you.
The fact remains that Googlebot wants to see a 404 in this situation.

impossiblystupid · June 2016

@globalRegisters said:

@impossiblystupid said:
I'd argue that 204 is more proper than 404 here.

You can take whatever stand you want.

It shouldn't be about taking a stand, but deciding to do the correct thing. It is wrong for Google to be making up random URLs it has every reason to think will not lead to content. It is right to respond to them with a "no content" result.

The fact remains that Googlebot wants to see a 404 in this situation.

Do you have a documented reference for that "fact"? I can see why they might not want a 200 response (i.e., a soft 404). But a 204 should be seen as an even better response than a 404 to a request for content that is known to not exist.

Clouvider · June 2016

Isn't that the webmaster tools verification file?

They prove it to see if particular account/token is still allowed access.

Rallias · June 2016

impossiblystupid said: But a 204 should be seen as an even better response than a 404 to a request for content that is known to not exist.

Except the 200 series of errors indicate an acceptable response, a 400 series of errors indicates that the client was somehow wrong to make the request it did.

JustAMacUser · June 2016

impossiblystupid said: Do you have a documented reference for that "fact"? I can see why they might not want a 200 response (i.e., a soft 404). But a 204 should be seen as an even better response than a 404 to a request for content that is known to not exist.

The RFC would seem to disagree with you; regarding 204 it states:

The 204 (No Content) status code indicates that the server has successfully fulfilled the request and that there is no additional content to send in the response payload body.

Whereas 404 states:

The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists.

I get what you're saying... Google knows the request is not valid so if you're server knows that Google knows, 204 should be its response. But your server doesn't know and Google (or anything, for that matter) is just picking a random URL so your server should respond appropriately, with a 404.

I interpret 204 to be something like this: You have a WYSIWYG editor with a user-configurable toolbar, the configuration of said toolbar is stored server-side. When the user makes a change, a request is placed to the server to store that information; if the request succeeds that's 204--everything worked but the server has nothing else to say. If it fails, that's 4xx or 5xx, depending on why. (The above-linked RFC also provides a similar example for 204).

globalRegisters · June 2016

@Clouvider said:
Isn't that the webmaster tools verification file?

They prove it to see if particular account/token is still allowed access.

No it's not. What you are referring to is a file to verify you are in control of the website content for a domain.

This is a request for a web page that should not exist on the site and thus is expecting a 404.

quadhost · June 2016

I'd respond with a 404 not a 204.

impossiblystupid · June 2016

@JustAMacUser said:
The RFC would seem to disagree with you

I was aware of the RFC definitions before I picked that response code. It doesn't really disagree with me, either; it's wording is, at best, poorly chosen. I mean, I am successfully fulfilling the request without additional content, therefore a 204. I did find the "current representation" for Google's request (a lot of nothing :-), and I'm fully disclosing that it is nothing, so it's not a 404.

Google knows the request is not valid so if you're server knows that Google knows, 204 should be its response. But your server doesn't know

But it does now. I put the directive there myself! I get that the fallback position should not be a 200 (in most cases, anyway), but there are many other completely valid ways to handle such a URL than just kicking out a 404.

a request is placed to the server to store that information; if the request succeeds that's 204

It certainly can be, and maybe should be. But I'll wager that for modern web services, you actually get a lot more 200 responses to that than 204's.

To me it comes down to this: is Google broken or operating as designed when it sends these bad requests? If it is broken, it should of course get a 404 back so they can fix their spider (and/or I can fix my server). By all accounts, though, it is operating as intended when it intentionally spiders non-existant URLs, so it really should be getting a 2XX response of some kind on a well-run server.

By extension, if we look at "common" files that are expected to exist on most servers (like robots.txt or favicon.ico or any of the newer types that Apple is assuming everyone should start using), what is the "proper" response for them? They're nothing special, so they result in a 404 like anything else that isn't found. But not having them isn't really an error. And the solution to people who don't want them (and don't want the pseudo-error logged) should not be to create empty files all over the place and then return a 200. If I know what they want and I know I don't have it, what is a better response than a 204?

JustAMacUser · June 2016

@impossiblystupid said:

>

If I know what they want and I know I don't have it, what is a better response than a 204?

I see where you're coming from. At the same time I think the answer most people would respond to your question with is 404.

sin · June 2016

I just let the Googlebots do their thing because I get a lot of good traffic from Google and don't really want to mess it up.

joepie91 · September 2016

impossiblystupid said: I was aware of the RFC definitions before I picked that response code. It doesn't really disagree with me, either; it's wording is, at best, poorly chosen. I mean, I am successfully fulfilling the request without additional content, therefore a 204.

The intention of a 204 status code is generally to indicate that you've successfully fulfilled a non-idempotent request (think eg. adding an item through an API), but there's nothing more to send back other than "yeah, that worked".

A 204 is explicitly not meant to indicate that "I don't have anything for this URL" - that's what a 404 is for. A 204 is meant when you do have something for the URL, it just doesn't have a response payload associated with it. Hence its use in non-idempotent requests, where that "something" is a request handler.

impossiblystupid · September 2016

@joepie91 said:
A 204 is explicitly not meant to indicate that "I don't have anything for this URL" - that's what a 404 is for.

I don't see that as being the clear intent of the RFC, as I noted in my reply to JustAMacUser. You're welcome to quote and parse the specific wording you think supports your case. I maintain that if the client isn't making a request in good faith, it is perfectly reasonable for my server to respond with not an error, but "I see what you're trying to do, and . . ."

you get nothing

A 204 is meant when you do have something for the URL, it just doesn't have a response payload associated with it.

That is exactly what I have.

JustAMacUser · September 2016

Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."

jar · September 2016

@Damian said:
This reminds me of people who claim that their VPS is generating extreme load because Google is crawling their site.

Totally happens. Load up Wordpress with a crappy theme filled with more ajax than anything ever should, install 150 plugins, 2 or more of which should be for ecommerce, 3 for security. Should blow a couple of CPU cores on every page load

I wish I couldn't say that I've worked on this theoretical site many many times.

ricardo · September 2016

JustAMacUser said: Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."

A 418 might be an appropriate response for an impasse.

Razza · September 2016

impossiblystupid said: maintain that if the client isn't making a request in good faith, it is perfectly reasonable for my server to respond with not an error

Ok, I don't understand your logic if a page doesn't exist it's a 404 not 204 as 2xx are for success.

You don't have to be a arse with your willy wonka meme as @joepie91 was only trying to point out your flawed logic.

joepie91 · September 2016

impossiblystupid said: I don't see that as being the clear intent of the RFC

That's because the RFCs were 1) not very detailed to begin with (in part because HTTP is meant to be generically usable), 2) not correctly implemented everywhere, and 3) the de facto implementations mostly decided the "real" meaning of HTTP status codes.

The specifications are quite clear about the intent of the 404 status code:

The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address.

Logically speaking, and according to the usual rule of specifications that "more specific rules trump less specific rules", that means that for a non-existent resource, you are to use 404 unless it should be a 410 instead - and the 204 status code is meant for cases where ...

The server has fulfilled the request but does not need to return an entity-body, and might want to return updated metainformation.

... but excluding the cases where the resource is not found, since that is already covered by 404. Further, "fulfilled" here means "processing the request by serving the requested resource". If a resource is not found, you cannot do that.

impossiblystupid said: By extension, if we look at "common" files that are expected to exist on most servers (like robots.txt or favicon.ico or any of the newer types that Apple is assuming everyone should start using), what is the "proper" response for them? They're nothing special, so they result in a 404 like anything else that isn't found. But not having them isn't really an error.

Yes, it is. The client requested a resource that does not exist. That's a client error.

impossiblystupid said: To me it comes down to this: is Google broken or operating as designed when it sends these bad requests? If it is broken, it should of course get a 404 back so they can fix their spider (and/or I can fix my server). By all accounts, though, it is operating as intended when it intentionally spiders non-existant URLs, so it really should be getting a 2XX response of some kind on a well-run server.

It's not your job to try and determine the intentions of a client. The whole point of how HTTP is designed, is that it's to be both server-neutral and client-neutral - stateless messages that provide only objective information, and it's up to the receiving party to interpret it according to its own expectations.

Hence, you return a 404 for "not found", and let Googlebot worry about what a 404 means for its purpose.

Look, you can try to redefine status codes and "well, technically" your way through it, but the reality is that you are going to break shit, because you're violating the expectations the clients have. Just stick with the spec, and when the spec is unclear, work on understanding what the spirit of the specification is, and how it is commonly implemented. There's really no discussion to be had here.

zafouhar · September 2016

Since Googlebot is drunk, drink with it!

joepie91 · September 2016

@JustAMacUser said:
Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."

There is! Well, sort of:

The 422 (Unprocessable Entity) status code means the server understands the content type of the request entity (hence a 415(Unsupported Media Type) status code is inappropriate), and the syntax of the request entity is correct (thus a 400 (Bad Request) status code is inappropriate) but was unable to process the contained instructions. For example, this error condition may occur if an XML request body contains well-formed (i.e., syntactically correct), but semantically erroneous, XML instructions.

exussum · September 2016

204 says no content. If you serve any HTML your doing it wrong. A 204 should be a white page only. 404 is correct useage for not finding a client specified page

Neoon · September 2016

Howdy, Stranger!

Categories

In this Discussion

What do you do with a drunken Googlebot?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

What do you do with a drunken Googlebot?

Comments