Ethical Hacking #reconnaissance#osint#subdomain-enumeration

Web Application Reconnaissance Methodology

A complete web app recon methodology covering Google dorks, Shodan, Subfinder, Amass, Nikto, WhatWeb, certificate transparency, and building a target profile.

7 min read

Reconnaissance is the most important phase of any web application penetration test. The more you understand about a target before you send a single malicious request, the more targeted and effective your testing will be. This guide covers a structured approach to web recon, from fully passive OSINT to active enumeration with dedicated tools.

The Recon Mindset

Reconnaissance has two modes:

Passive recon — Gather information without touching the target’s servers. Uses third-party sources, public databases, and search engines. Leaves no trace in the target’s logs. This is always appropriate, even before authorization is confirmed.

Active recon — Directly query or probe the target’s infrastructure. Generates traffic that appears in server logs and may trigger security alerts. Only conduct active recon within an authorized scope.

Passive Reconnaissance

Google Dorks

Google’s search operators expose information that isn’t meant to be indexed. These are completely passive — you’re just querying Google, not the target.

Key operators:

site:target.com              — Limit results to this domain
site:target.com filetype:pdf — Find indexed PDF files
site:target.com inurl:admin  — URLs containing "admin"
site:target.com intitle:"index of" — Open directory listings
site:target.com ext:sql      — SQL dump files
site:target.com "password"   — Pages containing the word "password"
"@target.com" site:linkedin.com — Employee emails on LinkedIn
site:target.com -www         — Subdomains (exclude www)
cache:target.com/admin       — Google's cached version

Combine operators for more targeted results:

site:target.com filetype:log | filetype:sql | filetype:conf
site:target.com inurl:config | inurl:backup | inurl:env

The GHDB (Google Hacking Database) at exploit-db.com/google-hacking-database maintains thousands of tested dorks organized by category.

Shodan

Shodan is a search engine for internet-connected devices. It indexes banners, certificates, and service fingerprints from regular scans.

Useful Shodan queries:

hostname:target.com          — All indexed hosts for the domain
org:"Target Company"        — Hosts by organization name
ssl:"target.com"            — Hosts with certs mentioning target.com
http.title:"Target App"     — Find specific page titles
port:8080 hostname:target.com — Non-standard ports
vuln:CVE-2023-44487 hostname:target.com — Hosts with specific CVE

Shodan reveals:

  • Forgotten test/staging servers
  • Internal services accidentally exposed to the internet
  • Outdated software versions on public-facing services
  • Non-standard ports running sensitive services (Jenkins on 8080, etc.)

Use the Shodan CLI for scripting:

pip3 install shodan
shodan init YOUR_API_KEY
shodan search "hostname:target.com" --fields ip_str,port,transport,product

Wayback Machine

The Internet Archive’s Wayback Machine preserves historical snapshots of web pages. Useful for finding:

  • Old, removed pages that might still be functional
  • JavaScript files with API endpoints or keys
  • Former employee pages with email patterns
  • Previous content disclosures
# waybackurls — extract all URLs the Wayback Machine knows about
go install github.com/tomnomnom/waybackurls@latest
echo "target.com" | waybackurls | tee wayback_urls.txt

# Filter for interesting file types
cat wayback_urls.txt | grep -E "\.(json|xml|env|sql|log|bak|old|config|key|pem)$"

# Find JS files for endpoint discovery
cat wayback_urls.txt | grep "\.js$" | sort -u

Certificate Transparency Logs

When a TLS certificate is issued for any domain, it’s logged in public certificate transparency logs. This reveals subdomains — even internal or forgotten ones — if they have valid TLS certificates.

# crt.sh — free web interface and API
curl "https://crt.sh/?q=%.target.com&output=json" | jq '.[].name_value' | sort -u

# Using the API with filtering
curl "https://crt.sh/?q=%.target.com&output=json" | \
  python3 -c "import sys,json; [print(x['name_value']) for x in json.load(sys.stdin)]" | \
  sort -u | grep -v '*'

theHarvester

theHarvester aggregates OSINT from multiple sources to find emails, subdomains, IPs, and URLs:

theHarvester -d target.com -b google,bing,certspotter,crtsh -l 200

Sources it queries include Google, Bing, DuckDuckGo, LinkedIn, GitHub, Shodan, and certificate transparency logs. Always a good early step for gathering email patterns (which inform phishing test parameters).

Subdomain Enumeration

Subfinder — Passive Subdomain Discovery

Subfinder queries passive DNS sources (VirusTotal, Shodan, crt.sh, etc.) without touching the target:

# Install
go install -v github.com/projectdiscovery/subfinder/v2/cmd/subfinder@latest

# Basic scan
subfinder -d target.com -o subdomains.txt

# With all sources and verbose output
subfinder -d target.com -all -v -o subdomains.txt

Configure API keys in ~/.config/subfinder/provider-config.yaml for significantly more results from VirusTotal, Shodan, SecurityTrails, and others.

Amass — Comprehensive Enumeration

Amass combines passive and active enumeration techniques and builds a detailed attack graph:

# Install
go install -v github.com/owasp-amass/amass/v4/...@master

# Passive enumeration only
amass enum -passive -d target.com -o amass_passive.txt

# Active + passive
amass enum -d target.com -o amass_full.txt

# Visualize the attack surface graph
amass viz -d3 -d target.com -o graph.html

Amass is slower than Subfinder but more thorough, especially with API keys configured at ~/.config/amass/config.yaml.

Resolving and Validating Subdomains

Raw subdomain lists contain many false positives. Resolve them to confirm which are live:

# dnsx — fast DNS resolver
go install -v github.com/projectdiscovery/dnsx/cmd/dnsx@latest

cat subdomains.txt | dnsx -resp -o live_subdomains.txt

Then check which resolved hosts have active web services:

# httpx — HTTP probing
go install -v github.com/projectdiscovery/httpx/cmd/httpx@latest

cat live_subdomains.txt | httpx -status-code -title -tech-detect -o live_web.txt

Active Reconnaissance

WhatWeb — Technology Fingerprinting

WhatWeb identifies web technologies, frameworks, and server software:

whatweb https://target.com
whatweb -a 3 https://target.com  # Aggressive mode

Output identifies CMS (WordPress, Drupal), JavaScript frameworks (React, jQuery versions), server software (nginx 1.18.0, Apache 2.4.41), and security headers. Version information feeds directly into vulnerability searching.

Nikto — Web Vulnerability Scanner

Nikto scans for common misconfigurations and known vulnerabilities:

nikto -h https://target.com
nikto -h https://target.com -o nikto_results.html -Format html
nikto -h https://target.com -ssl -p 443

Nikto checks for:

  • Missing security headers (X-Frame-Options, Content-Security-Policy, HSTS)
  • Default files and directories
  • Outdated software versions
  • Common file exposures (.git, .svn, robots.txt disclosures)
  • Known CVEs for identified software

Nikto is not stealthy — it generates significant log noise. Only run it during authorized active testing.

Directory Enumeration

# Gobuster
gobuster dir -u https://target.com -w /usr/share/seclists/Discovery/Web-Content/common.txt -o dirs.txt

# FFUF with auto-calibration
ffuf -u https://target.com/FUZZ -w /usr/share/seclists/Discovery/Web-Content/directory-list-2.3-medium.txt -ac -o ffuf_results.json -of json

Building a Target Profile

After gathering all recon data, synthesize it into a structured profile:

Target Profile Template

# Target: target.com
## Scope
- IP ranges: 203.0.113.0/24
- Domains: target.com, *.target.com

## Subdomains (live)
- www.target.com — Main site, WordPress 6.4, nginx
- admin.target.com — Admin portal, Apache 2.4.54, PHP 8.1
- api.target.com — REST API, returns JSON
- staging.target.com — Staging environment, same stack
- dev.target.com — Development server, self-signed cert

## Technology Stack
- Frontend: React 18, jQuery 3.6
- Backend: PHP 8.1, Laravel
- Database: MySQL (inferred from error messages)
- Server: nginx/1.22.1
- CDN: Cloudflare

## Interesting Findings
- /.git/ directory exposed on dev.target.com
- /backup/ returns 403 on admin.target.com
- robots.txt disallows /api/internal/ and /admin/
- Old login page at /login-old.php (found via Wayback Machine)
- SSL cert issued 2024-03-15, SAN includes internal.target.com

## Email Pattern
- firstname.lastname@target.com (confirmed via LinkedIn + theHarvester)
- Employees identified: 12 via LinkedIn scraping

Tools Summary Table

ToolTypePurpose
Google DorksPassiveFile discovery, subdomain clues
ShodanPassiveExposed services and ports
Wayback Machine / waybackurlsPassiveHistorical URLs and JS files
crt.shPassiveCertificate transparency subdomains
theHarvesterPassiveEmails, subdomains, IPs
SubfinderPassiveSubdomain enumeration
AmassPassive + ActiveComprehensive subdomain graph
WhatWebActiveTechnology fingerprinting
NiktoActiveVulnerability scanning
Gobuster / FFUFActiveDirectory and endpoint brute-force
httpxActiveHTTP probing of live hosts

Thorough reconnaissance is what separates a professional penetration test from a noisy automated scan. The time invested in understanding the target’s attack surface pays dividends in the quality and precision of your findings — and demonstrates genuine expertise to your client.

#shodan #web-security #subdomain-enumeration #osint #reconnaissance