Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


What do you do with a drunken Googlebot? - Page 2
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

What do you do with a drunken Googlebot?

2»

Comments

  • @JustAMacUser said:
    Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."

    There must be, unless you're suggesting that the server just hang and wait for the timeout.

    @ricardo said:
    A 418 might be an appropriate response for an impasse.

    That's still an error that gets logged on my end. The only error I should see is one I can fix, or one that is legitimately a problem for the end user.

    @Razza said:
    Ok, I don't understand your logic if a page doesn't exist it's a 404 not 204 as 2xx are for success.

    Exactly. It is by all measures a success if a request comes in for a URL where no content is expected, and I return just that. There are all sorts of things you can reasonably return other than a 404 when an "bad" request is received, the most common probably being a redirect.

    You don't have to be a arse with your willy wonka meme as @joepie91 was only trying to point out your flawed logic.

    It's just a graphical depiction of what my server is telling Google, not any sort of attack on joepie.

    @joepie91 said:
    The specifications are quite clear about the intent of the 404 status code:

    The server has not found anything matching the Request-URI.

    Exactly why 404 doesn't apply here. I did find a match for Google's fabricated URLs: no content.

    and the 204 status code is meant for cases where ...

    The server has fulfilled the request but does not need to return an entity-body, and might want to return updated metainformation.

    Exactly the situation outlined in my original post (here's your nothing, and have a cookie). So either you're agreeing that 204 is the correct response, or you're suggesting that some other "more specific" 2xx code is appropriate. I'm all ears regarding what you think is the correct non-error result to give.

    ... but excluding the cases where the resource is not found, since that is already covered by 404. Further, "fulfilled" here means "processing the request by serving the requested resource". If a resource is not found, you cannot do that.

    Again, I did find the resource. It was nothing. I returned that nothing.

    Let me ask you to respond to what you think is the correct way to deal with robots.txt like I mentioned previously. If I have nothing to tell the spiders, I can either leave it returning a 404 error, create an otherwise empty file returning a 200, or give a 204 like I do for these other requests. What do you think is the best practice for all parties involved?

    Yes, it is. The client requested a resource that does not exist. That's a client error.

    But Google isn't fixing their error. In fact, they intentionally wrote their spider to behave badly like that. I have my RFC-compliant workaround that keeps Google's problems out of my log file. What do you suggest is the better way to deal with the drunken Googlebot?

    It's not your job to try and determine the intentions of a client.

    Yes, it is. If I have a client coming in (here's another fabrication Google loves to do) looking for a /m/ or /mobile/ site, I absolutely can determine what their intention is in relation to my server, and consequently redirect them back to / if I have a responsive site (or a subdomain, or whatever else might reasonably fulfill their request). Everything a web server does beyond serving up static files is about trying to determine the intentions of a client.

    Hence, you return a 404 for "not found", and let Googlebot worry about what a 404 means for its purpose.

    But I did find the resource they requested! It was nothing. I returned it with the proper 204 code. They're welcome to figure out what that means for their purpose, too.

    Look, you can try to redefine status codes and "well, technically" your way through it, but the reality is that you are going to break shit, because you're violating the expectations the clients have. Just stick with the spec, and when the spec is unclear, work on understanding what the spirit of the specification is, and how it is commonly implemented. There's really no discussion to be had here.

    I thought so, too, when it ended months ago. Yet here we are again. :-) I again refer you to the robots.txt example if you need something more concrete to consider; I'm not trying to "redefine" anything.

  • joepie91joepie91 Member, Patron Provider

    If you want to pretend that you've found a resource that doesn't exist, then there's no discussion to be had here. This is precisely the "well, technically"ing that I was referring to. I have no intention of discussing here (since I already know what the correct answer is, and your arguments are nonsense), I'm just trying to explain it to you.

    If you want to ignore that, that's your call.

    impossiblystupid said: Let me ask you to respond to what you think is the correct way to deal with robots.txt like I mentioned previously. If I have nothing to tell the spiders, I can either leave it returning a 404 error, create an otherwise empty file returning a 200, or give a 204 like I do for these other requests. What do you think is the best practice for all parties involved?

    A 404.

    Thanked by 3mycosys Clouvider Razza
  • @impossiblystupid said:

    @JustAMacUser said:
    Sadly there isn't a response code for "I know what you're doing but disagree with your methodology."

    There must be, unless you're suggesting that the server just hang and wait for the timeout.

    I was actually making a joke, but in this particular case the correct response to a request for a resource that is not in existence is 404.

  • rincewindrincewind Member
    edited September 2016

    Lol. Waiting for the day when AI bots start trolling each other #FutureofLET

  • @rincewind said:
    Lol. Waiting for the day when AI bots start trolling each other #FutureofLET

    Oh god, please tell me you havent let the luggage into cyperspace?

    Thanked by 1rincewind
  • @joepie91 said:
    If you want to pretend that you've found a resource that doesn't exist, then there's no discussion to be had here. This is precisely the "well, technically"ing that I was referring to. I have no intention of discussing here (since I already know what the correct answer is, and your arguments are nonsense), I'm just trying to explain it to you.

    And from my point of view, it is you who is pulling the "well, technically" card. I'm not in any way "pretending" to find a resource that doesn't exist. Whether it's an empty robots.txt file or any of Google's fabricated URLs (or all sorts of other similar files that clients just assume will exist), I have specifically implemented the process of finding them, and it turns out they lack any useful content, so I respond appropriately.

    impossiblystupid said: Let me ask you to respond to what you think is the correct way to deal with robots.txt like I mentioned previously. If I have nothing to tell the spiders, I can either leave it returning a 404 error, create an otherwise empty file returning a 200, or give a 204 like I do for these other requests. What do you think is the best practice for all parties involved?

    A 404.

    We simply have different standards for how a web site should be properly run. If I can fix an error, I do. I don't know why that approach seems to rub people the wrong way.

  • joepie91joepie91 Member, Patron Provider

    impossiblystupid said: And from my point of view, it is you who is pulling the "well, technically" card.

    No. I'm telling you what clients expect.

    impossiblystupid said: I'm not in any way "pretending" to find a resource that doesn't exist. Whether it's an empty robots.txt file or any of Google's fabricated URLs (or all sorts of other similar files that clients just assume will exist), I have specifically implemented the process of finding them, and it turns out they lack any useful content, so I respond appropriately.

    I don't think you understand what "resource" means here...

    impossiblystupid said: We simply have different standards for how a web site should be properly run.

    Again - I'm telling you what clients expect. This isn't really a point of discussion. You can either implement it, or choose to break shit instead.

  • SplitIceSplitIce Member, Host Rep

    Redirect your site to 127.0.0.1 and give Googlebot a ride home.

  • @impossiblystupid said:

    @joepie91 said:
    If you want to pretend that you've found a resource that doesn't exist, then there's no discussion to be had here. This is precisely the "well, technically"ing that I was referring to. I have no intention of discussing here (since I already know what the correct answer is, and your arguments are nonsense), I'm just trying to explain it to you.

    And from my point of view, it is you who is pulling the "well, technically" card. I'm not in any way "pretending" to find a resource that doesn't exist. Whether it's an empty robots.txt file or any of Google's fabricated URLs (or all sorts of other similar files that clients just assume will exist), I have specifically implemented the process of finding them, and it turns out they lack any useful content, so I respond appropriately.

    Mate - very simple. There is a difference between an enpty set and an undefined set. 2xx indicates and empty set, a valid file that has been processed and served. 404 indicates there is no set defined, not just no content but no content container. If someone tells you to go get a box in the corner would you return after not finding it and tell them it was empty? Only if it was a practical joke.

  • 204 means there is no content to return. Or nothing serverd apart from the header.

    https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

    The 204 response MUST NOT include a message-body, and thus is always terminated by the first empty line after the header fields

    ^^^^^

    If you serve anything other than a blank page with that 204. Your using it wrong. You should be using a 404. The client requested something that doesn't exist.

  • @impossiblystupid said:
    Any other tips or tricks you use to keep the clutter in your log files to a minimum?

    I would disable the logging altogether if I can't deal with the log itself.

  • Jeez...what a thick ****.

  • @joepie91 said:

    impossiblystupid said: And from my point of view, it is you who is pulling the "well, technically" card.

    No. I'm telling you what clients expect.

    But you had just previously said "It's not your job to try and determine the intentions of a client." Are you changing your tune on that point?

    I don't think you understand what "resource" means here...

    And I don't think you understand the usefulness of a null return value. Your JavaScript coding advice must be horrible! :-)

    Again - I'm telling you what clients expect. This isn't really a point of discussion. You can either implement it, or choose to break shit instead.

    Absolutely nothing should break if a client gets back an 204 response for a resource instead of a 404 or a 200 or a 301. That is especially true if the URL was completely fabricated in the first place, so the client should have no expectation on what gets returned.

    @mycosys said:
    If someone tells you to go get a box in the corner would you return after not finding it and tell them it was empty? Only if it was a practical joke.

    Or if it was a crazy drunk who was constantly asking me to get their box. Or a child. I would absolutely pantomime giving them an invisible box. My mind is still flexible to that kind of lateral thinking.

    @exussum said:
    If you serve anything other than a blank page with that 204.

    The original post makes it clear that I'm doing just that, so I don't know what you're going on about here. You could have even tested it directly with one of the example URLs. You'll find you get back a big fat Content-Length: 0.

  • Google is pretty lenient with status codes either way, it's either going to see "content that it will potentially index from a 200 status code, or a 30* redirect pointing to another potentially indexable resource", otherwise it's not going to do much else with it. If it continually sees errors of some kind, it may affect the crawl budget of your site.

    One way to 'end the argument' is to verify your site in Webmaster tools and see if there are errors being reported wrt your 204 responses. I suspect not. TBH, the consensus would be to serve a 404 but in practical terms, I don't think it's going to make a difference serving a 204.

    Thanked by 1impossiblystupid
  • dailydaily Member
    edited September 2016

    @exussum said:
    204 says no content. If you serve any HTML your doing it wrong. A 204 should be a white page only. 404 is correct useage for not finding a client specified page

    This. Strange this was ignored by the OP. (edit: It wasn't, my bad.)

    If you are sending even a page that notes "this page or content does not exist", then it isn't a 204, it is a 404.

    impossiblystupid said: You could have even tested it directly with one of the example URLs. You'll find you get back a big fat Content-Length: 0.

    Can't really when we don't have the website you're talking about.

  • @daily said:

    impossiblystupid said: You could have even tested it directly with one of the example URLs. You'll find you get back a big fat Content-Length: 0.

    Can't really when we don't have the website you're talking about.

    It's the one in my signature. Or the directives given in the original post could just be dropped into your own site config to test the results.

Sign In or Register to comment.