Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Google spidering based on security certificates?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Google spidering based on security certificates?

I decided to use Let's Encrypt to generate a certificate for a number of my subdomains, wrapping them all into a single SAN certificate. I included a subdomain that is used essentially only for direct client testing; it's not linked to anywhere else on the web. Soon afterwards, Google began to spider it, giving itself away with the same annoying errors I mentioned in this discussion.

Is this a known/documented practice by Google, and presumably other spiders? I get that the information is in the cert, I just didn't figure spiders went digging that deep in their quest to hoover up everything. In the future, I guess I'm going to have to generate dedicated certificates for my "secret" servers. Word to the wise.

Thanked by 2geekalot WiredBlade

Comments

  • hzrhzr Member
    edited September 2016

    Any certs are nearly instantly indexed

    All certs by all major providers (especially LE) is posted automatically to CT logs like http://crt.sh/ to prevent misissued/hacked/stolen certs/etc

    See here: https://www.certificate-transparency.org/what-is-ct

  • Interesting if there's a correlation between your certs and Google crawling. Are you 100% sure there wasn't another way Google could discover your domain, gTLD zone files, Chrome/Android phoning home etc etc?

  • If a person visits that subdomain using Chrome I assume Google's gonna know.

  • Google is also a domain registrar in their own right so they find out about new registrations the moment they happen. Maybe they react to new registrations by adding the domain to their queue of sites to index?

  • ClouviderClouvider Member, Patron Provider
    edited September 2016

    wouldn't it be easier to just create robots.txt and disallow what you don't want to be indexed?

  • joepie91joepie91 Member, Patron Provider

    impossiblystupid said: Is this a known/documented practice by Google, and presumably other spiders?

    Google's business is to index as much data as possible. While they don't explicitly document all the ways they collect data, I've certainly seen odd things before - like unlisted Pastebin.com pastes (without a noindex tag) getting indexed once sent through GMail. It wouldn't surprise me at all if they used certificates to discover other subdomains.

    Thanked by 2raindog308 Pwner
  • NeoonNeoon Community Contributor, Veteran
    edited September 2016

    @joepie91 said:
    While they don't explicitly document all the ways they collect data, I've certainly seen odd things before - like unlisted Pastebin.com pastes (without a noindex tag) getting indexed once sent through GMail.

    Skype had also Bots, that are visiting your Links, you posted in a Chat.

  • @hzr said:
    All certs by all major providers (especially LE) is posted automatically to CT logs like http://crt.sh/ to prevent misissued/hacked/stolen certs/etc

    This is interesting to know. It's not clear that my LE certificate was posted there, though, because a search of a domain in it turns up zilch. I also didn't see anything on the LE site that indicates they purposely do such a thing. I get the desire for transparency for widely used public sites, but I really don't like the idea of every cert I issue directing traffic to that site. Sometimes I want secure traffic without broadcasting the existence of the server, and it'd be nice if I could do that with something other than a self-signed certificate.

    @ricardo said:
    Interesting if there's a correlation between your certs and Google crawling. Are you 100% sure there wasn't another way Google could discover your domain, gTLD zone files, Chrome/Android phoning home etc etc?

    Can't be 100% sure just yet, what with a sample set of 1. When I get the time, I'm going to create another completely new subdomain that isn't in use anywhere except in a SAN certificate. If the new site gets spidered after the cert is read for another site on the list, then it'll be pretty clear.

    @Abdussamad said:
    Google is also a domain registrar in their own right so they find out about new registrations the moment they happen. Maybe they react to new registrations by adding the domain to their queue of sites to index?

    It wasn't a top-level domain, but a subdomain. And, in this case, it wasn't new at all, but from 2015. If Google had seen it any other way, I expect they would have been showing up in the logs before now.

    @Clouvider said:
    wouldn't it be easier to just create robots.txt and disallow what you don't want to be indexed?

    It's not about the indexing, it's about the broadcasting the existence of the server to the world at large. I didn't expect that, and I don't want it. It's not that big a deal for the site in question this time, but I'm just thinking about sites in the future that I want to secure, but I don't want to broadcast the fact that I've got a new site for people to come mess with.

    Thanked by 1geekalot
  • impossiblystupid said: Can't be 100% sure just yet, what with a sample set of 1. When I get the time, I'm going to create another completely new subdomain that isn't in use anywhere except in a SAN certificate. If the new site gets spidered after the cert is read for another site on the list, then it'll be pretty clear.

    Do let us know. There are people eternally interested in how to get Googlebot to come along quicker.

  • jarjar Patron Provider, Top Host, Veteran
    edited September 2016

    To each his own I suppose. These days people are scanning IP ranges for web apps I just don't think it's a reasonable expectation to keep anything private while public, short of firewalling it off.

  • joepie91joepie91 Member, Patron Provider
    edited September 2016

    impossiblystupid said: it's about the broadcasting the existence of the server to the world at large. I didn't expect that, and I don't want it.

    Right, you did that by exposing it in the TLS certificate. Any user could've discovered it the same way. The solution there is to not create infoleaks.

    EDIT: Why does this matter anyway? Security through obscurity isn't real security, and even if your server is public, that shouldn't be a problem.

  • "Information wants to be free"

  • hzrhzr Member
    edited September 2016

    impossiblystupid said: This is interesting to know. It's not clear that my LE certificate was posted there, though, because a search of a domain in it turns up zilch. I also didn't see anything on the LE site that indicates they purposely do such a thing. I get the desire for transparency for widely used public sites, but I really don't like the idea of every cert I issue directing traffic to that site. Sometimes I want secure traffic without broadcasting the existence of the server, and it'd be nice if I could do that with something other than a self-signed certificate.

    Almost all major CAs do this, especially any respectable ones. Any hostnames issued on a cert will be auto published on multiple sites within seconds to minutes for public audit purposes.

    If you want secure traffic, use your own CA and sign your own certs, and install your root on devices that you intend to use it on like corporate networks.

    Thanked by 1ricardo
  • impossiblystupid said: I also didn't see anything on the LE site that indicates they purposely do such a thing

    They have to. It's in the CA contracts.

  • @jarland said:
    To each his own I suppose. These days people are scanning IP ranges for web apps I just don't think it's a reasonable expectation to keep anything private while public, short of firewalling it off.

    Well, so long as I don't broadcast the creation of a subdomain, it remains more-or-less part of the dark web; just having an IP only gets a nosy person the default site on my server, not any particular virtual host.

    @joepie91 said:
    EDIT: Why does this matter anyway? Security through obscurity isn't real security, and even if your server is public, that shouldn't be a problem.

    All online security is via obscurity. Your password or private key only keeps you safe because you don't show it to everyone. Your accounts are easier to keep safe if an attacker doesn't know your username or server name/IP, either.

    For example, I can freely give you a password to one of my online accounts: 778195d22. What good does that do you? Hell, I'll even tell you what server it's for: it's my Hulu account! Still, good luck making use of it without knowing my account name.

    Obscurity is fantastic security. From an infosec standpoint, all bits are fungible.

    @hzr said:
    Almost all major CAs do this, especially any respectable ones. Any hostnames issued on a cert will be auto published on multiple sites within seconds to minutes for public audit purposes.

    Right you are. Turns out my search had a typo. And, for everyone's reference, the URL where Let's Encrypt discusses their transparency is:

    https://letsencrypt.org/certificates/

    If you want secure traffic, use your own CA and sign your own certs, and install your root on devices that you intend to use it on like corporate networks.

    I do, to the extent that I have that level of access. For clients I consult, or for Internet-exposed test sites, though, sometimes the best you're allowed is to tell them to accept the "insecure" cert.

  • trewqtrewq Administrator, Patron Provider

    @impossiblystupid I don't understand why you're so passionate about google not discovering a site you host. It's literally what the business was first created for. If you don't want a site accessed then setup a firewall, it's as simple as that.

  • joepie91joepie91 Member, Patron Provider

    impossiblystupid said: All online security is via obscurity. Your password or private key only keeps you safe because you don't show it to everyone.

    You seem to have a habit of redefining things to fit your view on the world. No, passwords and private keys are explicitly exempt from security-through-obscurity.

    This is starting to get really tiring. Please at least make an effort at understanding the technologies and concepts you're talking about, rather than interpreting everything literally word-for-word.

    impossiblystupid said: Obscurity is fantastic security.

    No, it isn't.

    Thanked by 2Clouvider vimalware
  • @impossiblystupid said:

    Well, so long as I don't broadcast the creation of a subdomain, it remains more-or-less part of the dark web; just having an IP only gets a nosy person the default site on my server, not any particular virtual host.

    lmao. ANYONE can find it with a basic crawler/scanner, and anyone accessing your site will show it to everyone on the data path, publicly. No it is not

    @joepie91 said:
    EDIT: Why does this matter anyway? Security through obscurity isn't real security, and even if your server is public, that shouldn't be a problem.

    All online security is via obscurity. Your password or private key only keeps you safe because you don't show it to everyone. Your accounts are easier to keep safe if an attacker doesn't know your username or server name/IP, either.

    For example, I can freely give you a password to one of my online accounts: 778195d22. What good does that do you? Hell, I'll even tell you what server it's for: it's my Hulu account! Still, good luck making use of it without knowing my account name.

    Obscurity is fantastic security. From an infosec standpoint, all bits are fungible.

    You have GROSSLY confused physical security with security through obscurity. Security through obscurity is no security at all - it is leaving your front door unlocked and hoping nobody will walk down the alley it is in. Physical security is locking it and taking the key or code with you, and the phsical protection of those keys.
    The two are NOT related. Hiding your key is not security through obscurity, it is physical security.

    Thanked by 1k0nsl
  • @trewq said:
    I don't understand why you're so passionate about google not discovering a site you host. It's literally what the business was first created for. If you don't want a site accessed then setup a firewall, it's as simple as that.

    I'm not particularly "passionate" about it, I just noted it as something I wasn't expecting to happen. And it's not about Google specifically, but just the general "surface area" of attack that gets larger as a result of these certificate transparency practices. And, as far as I know, there is no way to firewall off port 443 traffic based on virtual hosts. I can certainly deny access in the Apache config files, making it yet another thing I have to babysit because someone else has a hammer.

    @joepie91 said:

    impossiblystupid said: All online security is via obscurity. Your password or private key only keeps you safe because you don't show it to everyone.

    You seem to have a habit of redefining things to fit your view on the world. No, passwords and private keys are explicitly exempt from security-through-obscurity.

    Why? It seems to be you who is conveniently redefining the meaning of things to allow you to be "exempt" from really thinking about the underlying issues. Bits are bits when it comes to cryptographic strength: the more you give away, the weaker your protection is.

    This is starting to get really tiring. Please at least make an effort at understanding the technologies and concepts you're talking about, rather than interpreting everything literally word-for-word.

    Please consider that my understanding might actually be greater than yours.

    Obscurity is fantastic security.

    No, it isn't.

    It is when you don't conveniently declare certain things exempt from the definition.

    @mycosys said:
    lmao. ANYONE can find it with a basic crawler/scanner, and anyone accessing your site will show it to everyone on the data path, publicly. No it is not

    How? Please explain or point to the software that will do what you claim.

    Obscurity is fantastic security. From an infosec standpoint, all bits are fungible.

    You have GROSSLY confused physical security with security through obscurity. Security through obscurity is no security at all - it is leaving your front door unlocked and hoping nobody will walk down the alley it is in.

    And you are secure if nobody walks down that alley. Just like a password on a sticky is secure from non-physical attacks. I mean, hell, I have Bitcoin paper wallets that are the equivalent of exactly that. I trust them to be more secure than any of my online wallets.

    Physical security is locking it and taking the key or code with you, and the phsical protection of those keys.

    Only to the extent that those security measures have obscured attack vectors. Every junior level security wannabe I know can make a bump key.

    The two are NOT related. Hiding your key is not security through obscurity, it is physical security.

    Hardly. If someone knows the hiding place, the lock provides you with no security. If someone knows just the shape of the key, you have no security.

    It's all obscurity, people. Wake up to that fact and you'll approach security from a more useful perspective.

  • you really dont get thge difference between having to know a secret actively aquired beforehand, and being able to just stumble over it? ffs

    Please tell me you are trolling

  • impossiblystupid said:

    @mycosys said: lmao. ANYONE can find it with a basic crawler/scanner, and anyone accessing your site will show it to everyone on the data path, publicly. No it is not

    How? Please explain or point to the software that will do what you claim.

    Google's public DNS servers (8.8.8.8) keep logs. Their CDN can infer domains from referrer tags if you load jQuery or Google API.

    Passive listeners can capture DNS or HTTP requests. DNSSEC is also open to enumeration attacks - listing all hosts in a zone. Don't expect privacy for anything with a DNS entry.

    The only secure server is the one that you alone can access. Preferably switched off :)

    If you really want privacy, run your services on loopback and connect to them over a VPN or onion something something.

  • @mycosys said:
    you really dont get thge difference between having to know a secret actively aquired beforehand, and being able to just stumble over it? ffs

    Please tell me you are trolling

    I would rather the other people posting tell me they're actually thinking about security rather than parroting back some naive chatter they've heard. The deciding factor is not how active or random the attack is, but how likely either method is to break the particular form of obscurity that is used to maintain the security in question.

    Anybody coming into this thread has "stumbled" across my Hulu password. So what? What's the material difference between posting that alone vs. posting just my account name alone? If you don't have the right answer to that question, you don't really understand security.

    @rincewind said:

    impossiblystupid said:

    @mycosys said: lmao. ANYONE can find it with a basic crawler/scanner, and anyone accessing your site will show it to everyone on the data path, publicly. No it is not

    How? Please explain or point to the software that will do what you claim.

    Google's public DNS servers (8.8.8.8) keep logs. Their CDN can infer domains from referrer tags if you load jQuery or Google API.

    I don't use those things. Even if I did, they would be only in Google's logs, not broadcast to the public. That's my interest here: how anyone with my IP address can know what names resolve to it (sans an explicit rDNS entry).

    Passive listeners can capture DNS or HTTP requests.

    By what mechanism can this be done for an arbitrary server? Again, the issue is not how already insecure networks are leaking information, but how someone armed with just an IP address can do something like find all the virtual web hosts on it (or virtual email hosts, or any other similar service).

    DNSSEC is also open to enumeration attacks - listing all hosts in a zone. Don't expect privacy for anything with a DNS entry.

    Again, please reference the mechanisms that are in use to do this. I want to test and secure my servers as much as possible. The simple possibility of a vulnerability is not enough to worry me; I need to see the active exploit.

  • @ricardo said:

    impossiblystupid said: Can't be 100% sure just yet, what with a sample set of 1. When I get the time, I'm going to create another completely new subdomain that isn't in use anywhere except in a SAN certificate. If the new site gets spidered after the cert is read for another site on the list, then it'll be pretty clear.

    Do let us know. There are people eternally interested in how to get Googlebot to come along quicker.

    As a followup to this, I did create a random subdomain (a UUID) and included it in my main site certificate. I never even did so much as a DNS lookup on it, but Google did still find it, now some 2 weeks later. It's clear they aren't doing it based on the real-time certificate transparency logs (yet!), so maybe they pull it straight from the site based on their regular spidering schedule. They might be quicker to process the certificate of a more popular domain, and/or one that already has https links going to it.

    Thanked by 2ricardo geekalot
  • ricardoricardo Member
    edited October 2016

    Have you checked Google to see if there are any links pointing to the subdomain?

    Perhaps the log files to your web server would make the test unequivocal, particularly so if nothing has visited.

  • @ricardo said:
    Have you checked Google to see if there are any links pointing to the subdomain?

    Nothing referencing the UUID at all.

    UUID

    Perhaps the log files to your web server would make the test unequivocal, particularly so if nothing has visited.

    The server wasn't set up to do detailed logging. It was just a trial experiment to verify that cert info is indeed on their radar. If you want to do some more comprehensive data gathering on their behavior, you'll have to set up a server to your own specifications and let us know how it goes.

  • ricardoricardo Member
    edited October 2016

    impossiblystupid said: you'll have to set up a server to your own specifications and let us know how it goes.

    Indeed, but I'm just trying to verify the statement you made which is the entire point of the thread, that Googlebot is spidering new hostnames that Google finds in security certificates

    The fact you can't find the UUID in Google's index at least indicates that Google didn't find the domain via [something else that may have found your site first], or at the very least chosen not to index it.

    The log files can verify these kind of things more strongly.

  • @ricardo said:
    Indeed, but I'm just trying to verify the statement you made which is the entire point of the thread, that Googlebot is spidering new hostnames that Google finds in security certificates

    That's why I set up an otherwise "invisible" new server. It is verified by way of a second sample. But it is still just data from a single source; scientific replication needs to be more rigorous.

    The log files can verify these kind of things more strongly.

    And that's why I say you, and anyone else who is still curious, needs to run the test on their own servers to see for themselves, and add to the data pool. Like I said, I wasn't setting it up to watch the watchmen, just to see if the watchmen were watching. They are, you're welcome to use even stronger tests, and it would be interesting to hear how/if their behavior changes when you want them to find your new server.

  • You mentione in your OP that you discovered it the 1st time via log files. How did you know the 2nd time?

  • @ricardo said:
    You mentione in your OP that you discovered it the 1st time via log files. How did you know the 2nd time?

    Log files, of course. As I've already said, I haven't done any special logging to get anyone any extra details about the process. I simply set up a completely fresh and new domain that nobody knew existed except for the process of giving it a certificate. And now it appears to be well known, to the point that I'm just going to delete it soon because I'm tired of seeing a long UUID in my logs.

    If you want further details, set up that logging to your specification. I think people would be much more interested in your angle of having a new domain you want to be found by Google.

  • I'm not going to set anything up, I'm just clarifying you know what you're doing but it's vague on details. Thanks for replying

Sign In or Register to comment.