## Table of Contents
- [[#How Google Works|How Google Works]]
- [[#How Google Works#Web Crawling|Web Crawling]]
- [[#How Google Works#Indexing|Indexing]]
- [[#How Google Works#Search Algorithms|Search Algorithms]]
- [[#Boolean Operators and Other Operators|Boolean Operators and Other Operators]]
- [[#Search Operators|Search Operators]]
- [[#How to Protect Against Dorking|How to Protect Against Dorking]]
- [[#How to Protect Against Dorking#robots.txt|robots.txt]]
- [[#How to Protect Against Dorking#noindex / nofollow|noindex / nofollow]]
- [[#How to Protect Against Dorking#Auditing Your Exposure|Auditing Your Exposure]]
- [[#How to Protect Against Dorking#Vulnerability Scanning|Vulnerability Scanning]]
- [[#How to Protect Against Dorking#File/Directory Permissions|File/Directory Permissions]]
- [[#Examples Queries|Examples Queries]]
- [[#Examples Queries#Person Searches|Person Searches]]
- [[#Examples Queries#Documents Searching|Documents Searching]]
- [[#Examples Queries#Exposed Assets|Exposed Assets]]
- [[#Examples Queries#API Assets|API Assets]]
- [[#Examples Queries#Cloud Storage Bucket Discovery|Cloud Storage Bucket Discovery]]
---
## What is Google Dorking?
- Google Dorking is the practice of using advanced Google search operators to find information that’s publicly accessible but often unintentionally exposed
- E.g.:
- Backup files
- Login portals
- Config files
- Sensitive documents
- Widely used technique in OSINT and recon because Google’s search index often sees more than most people would expect.
## How Google Works
### Web Crawling
- Googlebot is constantly scanning the internet by:
- Following all available links
- Reading the contents of pages
- Further exploring down directories
- If a file or folder is publicly reachable, even without a link, there’s a good chance that Google's web crawler will find it.
### Indexing
- Once a page is found, Google will analyse the content and stores them in its index.
- This includes:
- Text
- Metadata
- File Contents
- i.e. PDFs, spreadsheets, etc
- Titles
- Directory Structures
- This indexed content then becomes searchable, even if it wasn’t initially meant to be.
### Search Algorithms
- Google ranks search results based on relevance and quality, but google dorking bypasses this ranking process by letting you query the raw index directly with search operators like `filetype:`, `site:`, `intitle:`, or `inurl:`
---
## Boolean Operators and Other Operators
| Operator | Description | Example |
| -------- | ------------------------------------------------------------------------------- | --------------------------------------- |
| `AND` | Requires both terms to exist in results. Google usually applies AND implicitly. | `site:example.com` `admin AND password` |
| `OR` | Returns results containing either term. | `"login" OR "signin"` |
| `\|` | A shorthand for OR. | `"pdf" \| "docx"` |
| `""` | Searches for the exact phrase. | `"internal use only"` |
| `()` | Groups expressions for complex queries. | `(password OR passcode) "admin"` |
| `-` | Excludes specific words or terms. | `intitle:index -php` |
| `*` | Wildcard placeholder for any word or phrase. | `"confidential * report"` |
**Notes:**
- `AND` is used by default in between search terms
- i.e. These would produce the same
- `intext:"confidential"` `AND` `site:example.com`
- `intext:"confidential"` `site:example.com`
- `OR` and `|` can be used interchangeably.
- Google does not officially support `*` wildcards in `site:`.
- Use `site:example.com` `-site:www.example.com` instead.
## Search Operators
| Operator | Description | Example |
| ------------- | ---------------------------------------------------------------------------- | --------------------------------------------------- |
| `after:` | Returns results published after a specific date in the format `YYYY-MM-DD`. | `site:gov.au` `after:2023-01-01` `"cyber security"` |
| `allintext:` | Searches for pages where all of the specified words appear in the body text. | `allintext:password username login` |
| `allintitle:` | Searches for pages where all specified words appear in the page title. | `allintitle:index of backup` |
| `before:` | Returns results published before a specific date. | `breach database` `before:2020-01-01` |
| `cache:` | Shows the Google-cached version of a webpage. | `cache:example.com` |
| `filetype:` | Restricts results to a specific file format (e.g., pdf, xls, docx). | `site:edu.au` `filetype:pdf` `"exam answers"` |
| `info:` | Displays Google’s information page about a URL (cache, similar sites, etc.). | `info:example.com` |
| `inposttile:` | Searches for words appearing in blog post titles. | `inposttitle:"guest post"` |
| `intext:` | Searches for results where the term is in the body text. | `intext:"confidential"` `site:example.com` |
| `intitle:` | Searches for pages with a specific word in the title. | `intitle:"index of"` |
| `inurl:` | Searches for pages with the term in the URL. | `inurl:admin login` |
| `link:` | Shows pages that link to a specific URL. | `link:example.com` |
| `related:` | Shows sites similar to the provided URL. | `related:github.com` |
| `site:` | Restricts results to a specific domain or TLD. | `site:gov.au "confidential"` |
**Notes:**
- `cache:` - May be useful for viewing a page that is currently down or to see a previous version of a page that has recently been updated and not yet indexed.
- Google does not officially support `*` wildcards in `site:`.
- Use `site:example.com` `-site:www.example.com` instead.
## How to Protect Against Dorking
### robots.txt
- This is a file that tells legitimate web crawlers what not to index.
- Useful for search engine hygiene, but not really for security.
- Attackers can just ignore it, and by listing sensitive paths, you can actually highlight what you are trying to hide.
### noindex / nofollow
- `noindex` and `nofollow` are tags or HTTP headers that explicitly stop Google from indexing certain pages.
- Can be good for staging, non-production environments or non-public pages but not an appropriate substitute for putting in actual access controls.
### Auditing Your Exposure
- Regularly dork your own domain.
- You might also be able to check what Google has indexed through your Google Search Console.
- Look for:
- Exposed Directories
- Staging Sites
- Cloud Buckets
- Forgotten Assets
### Vulnerability Scanning
- Example Tools for web vulnerability scanning:
- Nikto
- Zed Attack Proxy (ZAP)
- Burp Suite
- Gobuster
- Can help detect files and directories that Google might index unknowingly.
- Useful for identifying:
- exposed admin panels
- misconfigurations
- forgotten endpoints
### File/Directory Permissions
- Use proper user access controls.
- Disable directory listing if possible.
- Make sure to keep backups outside the web root folder.
- Restrict all sensitive directories with authentication.
- Ensure that temp files or logs aren’t world-readable (unprivileged users can read their contents).
---
## Examples Queries
### Person Searches
```sh
# Linkedin Accounts for a Name
allintext:"John Doe" & site:linkedin.com
# Contact Information
"John Doe" "contact information"
allintext:"John Doe" AND intext:"phone number"
```
### Documents Searching
```sh
# PDF's at a domain
site:domain.tld filetype:pdf
# Sensitive Files
intitle:"Index of" ".env" site:example.com
"parent directory" ".log" site:example.com
```
### Exposed Assets
```sh
# Remote access gateways
inurl:/remote/login/ intitle:"RDP"
# Login Pages
inurl:login site:example.com
inurl:signin site:example.com
```
### API Assets
```sh
# Endpoints
site:example.com inurl:api
# Schema's
site:example.com inurl:schema
```
### Cloud Storage Bucket Discovery
```bash
# AWS Buckets
site:.s3.amazonaws.com "<target>"
#Cloudflare R2 Buckets
(site:.r2.cloudflarestorge.com OR site:.r2.dev) "<target>"
# Azure Blob Storage
site:blob.core.windows.net "<target>"
# Firebase Storage Buckets
(site:.firebasestorage.app OR site:.appspot.com OR site:.firebaseio.com) "<target>"
```