Click here to Skip to main content
15,886,199 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello all,

Let's say you have a web site that uses some resources in special folders.

PHP code access those resources, but they are not linked specifically in any of the pages...

Of course, if I use the robots.txt file to exclude any search engine to crawl those pages, I'm making the private folders names/paths public.

Should I simply not put all those pages into the robots file?

Now that I'm writing this I'm starting to think that putting them there makes it much easier for everyone to know the weak points of the web page...

but... how do you ensure those pages won't be crawled?

The underlying question is:

- How the sites are crawled? is it only following the links that appear on the same pages? without reading the real folder structure neither reading the PHP (or any other) code and reading only the final created page?

- And if a page is not specifically linked, will it be crawled even if it is not in the robots.txt?

Thank you very much!

What I have tried:

Just reading the help from the google webmaster pages...
Posted
Comments
Bernhard Hiller 8-May-17 5:35am    
Note that the "way back machine" announced a few days ago that it will not observe robots.txt anymore and crawl just everything...
Joan M 8-May-17 11:48am    
But how... can they read the folder contents or simply start reading the web page documents and if some file inside a "hidden" folder is found then it is crawled?
Richard Deeming 8-May-17 12:38pm    
It depends. If you're never going to serve up a file to the client from one of those folders, then you should probably prevent them from being accessed at all. If you're on Apache, you'd use the .htaccess file to do that.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900