## Table of Contents - [[#How Google Works|How Google Works]] - [[#How Google Works#Web Crawling|Web Crawling]] - [[#How Google Works#Indexing|Indexing]] - [[#How Google Works#Search Algorithms|Search Algorithms]] - [[#Boolean Operators and Other Operators|Boolean Operators and Other Operators]] - [[#Search Operators|Search Operators]] - [[#How to Protect Against Dorking|How to Protect Against Dorking]] - [[#How to Protect Against Dorking#robots.txt|robots.txt]] - [[#How to Protect Against Dorking#noindex / nofollow|noindex / nofollow]] - [[#How to Protect Against Dorking#Auditing Your Exposure|Auditing Your Exposure]] - [[#How to Protect Against Dorking#Vulnerability Scanning|Vulnerability Scanning]] - [[#How to Protect Against Dorking#File/Directory Permissions|File/Directory Permissions]] - [[#Examples Queries|Examples Queries]] - [[#Examples Queries#Person Searches|Person Searches]] - [[#Examples Queries#Documents Searching|Documents Searching]] - [[#Examples Queries#Exposed Assets|Exposed Assets]] - [[#Examples Queries#API Assets|API Assets]] - [[#Examples Queries#Cloud Storage Bucket Discovery|Cloud Storage Bucket Discovery]] --- ## What is Google Dorking? - Google Dorking is the practice of using advanced Google search operators to find information that’s publicly accessible but often unintentionally exposed - E.g.: - Backup files - Login portals - Config files - Sensitive documents - Widely used technique in OSINT and recon because Google’s search index often sees more than most people would expect. ## How Google Works ### Web Crawling - Googlebot is constantly scanning the internet by: - Following all available links - Reading the contents of pages - Further exploring down directories - If a file or folder is publicly reachable, even without a link, there’s a good chance that Google's web crawler will find it. ### Indexing - Once a page is found, Google will analyse the content and stores them in its index. - This includes: - Text - Metadata - File Contents - i.e. PDFs, spreadsheets, etc - Titles - Directory Structures - This indexed content then becomes searchable, even if it wasn’t initially meant to be. ### Search Algorithms - Google ranks search results based on relevance and quality, but google dorking bypasses this ranking process by letting you query the raw index directly with search operators like `filetype:`, `site:`, `intitle:`, or `inurl:` --- ## Boolean Operators and Other Operators | Operator | Description | Example | | -------- | ------------------------------------------------------------------------------- | --------------------------------------- | | `AND` | Requires both terms to exist in results. Google usually applies AND implicitly. | `site:example.com` `admin AND password` | | `OR` | Returns results containing either term. | `"login" OR "signin"` | | `\|` | A shorthand for OR. | `"pdf" \| "docx"` | | `""` | Searches for the exact phrase. | `"internal use only"` | | `()` | Groups expressions for complex queries. | `(password OR passcode) "admin"` | | `-` | Excludes specific words or terms. | `intitle:index -php` | | `*` | Wildcard placeholder for any word or phrase. | `"confidential * report"` | **Notes:** - `AND` is used by default in between search terms - i.e. These would produce the same - `intext:"confidential"` `AND` `site:example.com` - `intext:"confidential"` `site:example.com` - `OR` and `|` can be used interchangeably. - Google does not officially support `*` wildcards in `site:`. - Use `site:example.com` `-site:www.example.com` instead. ## Search Operators | Operator | Description | Example | | ------------- | ---------------------------------------------------------------------------- | --------------------------------------------------- | | `after:` | Returns results published after a specific date in the format `YYYY-MM-DD`. | `site:gov.au` `after:2023-01-01` `"cyber security"` | | `allintext:` | Searches for pages where all of the specified words appear in the body text. | `allintext:password username login` | | `allintitle:` | Searches for pages where all specified words appear in the page title. | `allintitle:index of backup` | | `before:` | Returns results published before a specific date. | `breach database` `before:2020-01-01` | | `cache:` | Shows the Google-cached version of a webpage. | `cache:example.com` | | `filetype:` | Restricts results to a specific file format (e.g., pdf, xls, docx). | `site:edu.au` `filetype:pdf` `"exam answers"` | | `info:` | Displays Google’s information page about a URL (cache, similar sites, etc.). | `info:example.com` | | `inposttile:` | Searches for words appearing in blog post titles. | `inposttitle:"guest post"` | | `intext:` | Searches for results where the term is in the body text. | `intext:"confidential"` `site:example.com` | | `intitle:` | Searches for pages with a specific word in the title. | `intitle:"index of"` | | `inurl:` | Searches for pages with the term in the URL. | `inurl:admin login` | | `link:` | Shows pages that link to a specific URL. | `link:example.com` | | `related:` | Shows sites similar to the provided URL. | `related:github.com` | | `site:` | Restricts results to a specific domain or TLD. | `site:gov.au "confidential"` | **Notes:** - `cache:` - May be useful for viewing a page that is currently down or to see a previous version of a page that has recently been updated and not yet indexed. - Google does not officially support `*` wildcards in `site:`. - Use `site:example.com` `-site:www.example.com` instead. ## How to Protect Against Dorking ### robots.txt - This is a file that tells legitimate web crawlers what not to index. - Useful for search engine hygiene, but not really for security. - Attackers can just ignore it, and by listing sensitive paths, you can actually highlight what you are trying to hide. ### noindex / nofollow - `noindex` and `nofollow` are tags or HTTP headers that explicitly stop Google from indexing certain pages. - Can be good for staging, non-production environments or non-public pages but not an appropriate substitute for putting in actual access controls. ### Auditing Your Exposure - Regularly dork your own domain. - You might also be able to check what Google has indexed through your Google Search Console. - Look for: - Exposed Directories - Staging Sites - Cloud Buckets - Forgotten Assets ### Vulnerability Scanning - Example Tools for web vulnerability scanning: - Nikto - Zed Attack Proxy (ZAP) - Burp Suite - Gobuster - Can help detect files and directories that Google might index unknowingly. - Useful for identifying: - exposed admin panels - misconfigurations - forgotten endpoints ### File/Directory Permissions - Use proper user access controls. - Disable directory listing if possible. - Make sure to keep backups outside the web root folder. - Restrict all sensitive directories with authentication. - Ensure that temp files or logs aren’t world-readable (unprivileged users can read their contents). --- ## Examples Queries ### Person Searches ```sh # Linkedin Accounts for a Name allintext:"John Doe" & site:linkedin.com # Contact Information "John Doe" "contact information" allintext:"John Doe" AND intext:"phone number" ``` ### Documents Searching ```sh # PDF's at a domain site:domain.tld filetype:pdf # Sensitive Files intitle:"Index of" ".env" site:example.com "parent directory" ".log" site:example.com ``` ### Exposed Assets ```sh # Remote access gateways inurl:/remote/login/ intitle:"RDP" # Login Pages inurl:login site:example.com inurl:signin site:example.com ``` ### API Assets ```sh # Endpoints site:example.com inurl:api # Schema's site:example.com inurl:schema ``` ### Cloud Storage Bucket Discovery ```bash # AWS Buckets site:.s3.amazonaws.com "<target>" #Cloudflare R2 Buckets (site:.r2.cloudflarestorge.com OR site:.r2.dev) "<target>" # Azure Blob Storage site:blob.core.windows.net "<target>" # Firebase Storage Buckets (site:.firebasestorage.app OR site:.appspot.com OR site:.firebaseio.com) "<target>" ```