Reconnaissance is the most important phase of any web application penetration test. The more you understand about a target before you send a single malicious request, the more targeted and effective your testing will be. This guide covers a structured approach to web recon, from fully passive OSINT to active enumeration with dedicated tools.
The Recon Mindset
Reconnaissance has two modes:
Passive recon — Gather information without touching the target’s servers. Uses third-party sources, public databases, and search engines. Leaves no trace in the target’s logs. This is always appropriate, even before authorization is confirmed.
Active recon — Directly query or probe the target’s infrastructure. Generates traffic that appears in server logs and may trigger security alerts. Only conduct active recon within an authorized scope.
Passive Reconnaissance
Google Dorks
Google’s search operators expose information that isn’t meant to be indexed. These are completely passive — you’re just querying Google, not the target.
Key operators:
site:target.com — Limit results to this domain
site:target.com filetype:pdf — Find indexed PDF files
site:target.com inurl:admin — URLs containing "admin"
site:target.com intitle:"index of" — Open directory listings
site:target.com ext:sql — SQL dump files
site:target.com "password" — Pages containing the word "password"
"@target.com" site:linkedin.com — Employee emails on LinkedIn
site:target.com -www — Subdomains (exclude www)
cache:target.com/admin — Google's cached version
Combine operators for more targeted results:
site:target.com filetype:log | filetype:sql | filetype:conf
site:target.com inurl:config | inurl:backup | inurl:env
The GHDB (Google Hacking Database) at exploit-db.com/google-hacking-database maintains thousands of tested dorks organized by category.
Shodan
Shodan is a search engine for internet-connected devices. It indexes banners, certificates, and service fingerprints from regular scans.
Useful Shodan queries:
hostname:target.com — All indexed hosts for the domain
org:"Target Company" — Hosts by organization name
ssl:"target.com" — Hosts with certs mentioning target.com
http.title:"Target App" — Find specific page titles
port:8080 hostname:target.com — Non-standard ports
vuln:CVE-2023-44487 hostname:target.com — Hosts with specific CVE
Shodan reveals:
- Forgotten test/staging servers
- Internal services accidentally exposed to the internet
- Outdated software versions on public-facing services
- Non-standard ports running sensitive services (Jenkins on 8080, etc.)
Use the Shodan CLI for scripting:
pip3 install shodan
shodan init YOUR_API_KEY
shodan search "hostname:target.com" --fields ip_str,port,transport,product
Wayback Machine
The Internet Archive’s Wayback Machine preserves historical snapshots of web pages. Useful for finding:
- Old, removed pages that might still be functional
- JavaScript files with API endpoints or keys
- Former employee pages with email patterns
- Previous content disclosures
# waybackurls — extract all URLs the Wayback Machine knows about
go install github.com/tomnomnom/waybackurls@latest
echo "target.com" | waybackurls | tee wayback_urls.txt
# Filter for interesting file types
cat wayback_urls.txt | grep -E "\.(json|xml|env|sql|log|bak|old|config|key|pem)$"
# Find JS files for endpoint discovery
cat wayback_urls.txt | grep "\.js$" | sort -u
Certificate Transparency Logs
When a TLS certificate is issued for any domain, it’s logged in public certificate transparency logs. This reveals subdomains — even internal or forgotten ones — if they have valid TLS certificates.
# crt.sh — free web interface and API
curl "https://crt.sh/?q=%.target.com&output=json" | jq '.[].name_value' | sort -u
# Using the API with filtering
curl "https://crt.sh/?q=%.target.com&output=json" | \
python3 -c "import sys,json; [print(x['name_value']) for x in json.load(sys.stdin)]" | \
sort -u | grep -v '*'
theHarvester
theHarvester aggregates OSINT from multiple sources to find emails, subdomains, IPs, and URLs:
theHarvester -d target.com -b google,bing,certspotter,crtsh -l 200
Sources it queries include Google, Bing, DuckDuckGo, LinkedIn, GitHub, Shodan, and certificate transparency logs. Always a good early step for gathering email patterns (which inform phishing test parameters).
Subdomain Enumeration
Subfinder — Passive Subdomain Discovery
Subfinder queries passive DNS sources (VirusTotal, Shodan, crt.sh, etc.) without touching the target:
# Install
go install -v github.com/projectdiscovery/subfinder/v2/cmd/subfinder@latest
# Basic scan
subfinder -d target.com -o subdomains.txt
# With all sources and verbose output
subfinder -d target.com -all -v -o subdomains.txt
Configure API keys in ~/.config/subfinder/provider-config.yaml for significantly more results from VirusTotal, Shodan, SecurityTrails, and others.
Amass — Comprehensive Enumeration
Amass combines passive and active enumeration techniques and builds a detailed attack graph:
# Install
go install -v github.com/owasp-amass/amass/v4/...@master
# Passive enumeration only
amass enum -passive -d target.com -o amass_passive.txt
# Active + passive
amass enum -d target.com -o amass_full.txt
# Visualize the attack surface graph
amass viz -d3 -d target.com -o graph.html
Amass is slower than Subfinder but more thorough, especially with API keys configured at ~/.config/amass/config.yaml.
Resolving and Validating Subdomains
Raw subdomain lists contain many false positives. Resolve them to confirm which are live:
# dnsx — fast DNS resolver
go install -v github.com/projectdiscovery/dnsx/cmd/dnsx@latest
cat subdomains.txt | dnsx -resp -o live_subdomains.txt
Then check which resolved hosts have active web services:
# httpx — HTTP probing
go install -v github.com/projectdiscovery/httpx/cmd/httpx@latest
cat live_subdomains.txt | httpx -status-code -title -tech-detect -o live_web.txt
Active Reconnaissance
WhatWeb — Technology Fingerprinting
WhatWeb identifies web technologies, frameworks, and server software:
whatweb https://target.com
whatweb -a 3 https://target.com # Aggressive mode
Output identifies CMS (WordPress, Drupal), JavaScript frameworks (React, jQuery versions), server software (nginx 1.18.0, Apache 2.4.41), and security headers. Version information feeds directly into vulnerability searching.
Nikto — Web Vulnerability Scanner
Nikto scans for common misconfigurations and known vulnerabilities:
nikto -h https://target.com
nikto -h https://target.com -o nikto_results.html -Format html
nikto -h https://target.com -ssl -p 443
Nikto checks for:
- Missing security headers (X-Frame-Options, Content-Security-Policy, HSTS)
- Default files and directories
- Outdated software versions
- Common file exposures (
.git, .svn, robots.txt disclosures)
- Known CVEs for identified software
Nikto is not stealthy — it generates significant log noise. Only run it during authorized active testing.
Directory Enumeration
# Gobuster
gobuster dir -u https://target.com -w /usr/share/seclists/Discovery/Web-Content/common.txt -o dirs.txt
# FFUF with auto-calibration
ffuf -u https://target.com/FUZZ -w /usr/share/seclists/Discovery/Web-Content/directory-list-2.3-medium.txt -ac -o ffuf_results.json -of json
Building a Target Profile
After gathering all recon data, synthesize it into a structured profile:
Target Profile Template
# Target: target.com
## Scope
- IP ranges: 203.0.113.0/24
- Domains: target.com, *.target.com
## Subdomains (live)
- www.target.com — Main site, WordPress 6.4, nginx
- admin.target.com — Admin portal, Apache 2.4.54, PHP 8.1
- api.target.com — REST API, returns JSON
- staging.target.com — Staging environment, same stack
- dev.target.com — Development server, self-signed cert
## Technology Stack
- Frontend: React 18, jQuery 3.6
- Backend: PHP 8.1, Laravel
- Database: MySQL (inferred from error messages)
- Server: nginx/1.22.1
- CDN: Cloudflare
## Interesting Findings
- /.git/ directory exposed on dev.target.com
- /backup/ returns 403 on admin.target.com
- robots.txt disallows /api/internal/ and /admin/
- Old login page at /login-old.php (found via Wayback Machine)
- SSL cert issued 2024-03-15, SAN includes internal.target.com
## Email Pattern
- firstname.lastname@target.com (confirmed via LinkedIn + theHarvester)
- Employees identified: 12 via LinkedIn scraping
| Tool | Type | Purpose |
|---|
| Google Dorks | Passive | File discovery, subdomain clues |
| Shodan | Passive | Exposed services and ports |
| Wayback Machine / waybackurls | Passive | Historical URLs and JS files |
| crt.sh | Passive | Certificate transparency subdomains |
| theHarvester | Passive | Emails, subdomains, IPs |
| Subfinder | Passive | Subdomain enumeration |
| Amass | Passive + Active | Comprehensive subdomain graph |
| WhatWeb | Active | Technology fingerprinting |
| Nikto | Active | Vulnerability scanning |
| Gobuster / FFUF | Active | Directory and endpoint brute-force |
| httpx | Active | HTTP probing of live hosts |
Thorough reconnaissance is what separates a professional penetration test from a noisy automated scan. The time invested in understanding the target’s attack surface pays dividends in the quality and precision of your findings — and demonstrates genuine expertise to your client.