engineeringbusiness
Technical SGEO Setup
Set up a site's technical foundation for visibility in both traditional search engines and AI platforms — covering crawlability, indexation, Core Web Vitals, structured data, mobile-first design, AI crawler access, and measurement infrastructure.
SEOGEOSGEOtechnical-SEOcrawlabilityCore-Web-VitalsAI-visibilitystructured-datarobots.txt
Works well with agents
$ npx skills add The-AI-Directory-Company/(…) --skill technical-sgeotechnical-sgeo/
SKILL.md
Markdown
| 1 | |
| 2 | # Technical SGEO Setup |
| 3 | |
| 4 | ## Tool discovery |
| 5 | |
| 6 | Before gathering project details, confirm which tools are available. |
| 7 | Ask the user directly — do not assume access to any external service. |
| 8 | |
| 9 | **Free tools (no API key required):** |
| 10 | - [ ] WebFetch (fetch any public URL — robots.txt, sitemaps, pages) |
| 11 | - [ ] WebSearch (search engine queries for competitive analysis) |
| 12 | - [ ] Google PageSpeed Insights API (CWV data, no key needed for basic usage) |
| 13 | - [ ] Google Rich Results Test (structured data validation) |
| 14 | - [ ] Playwright MCP or Chrome DevTools MCP (browser automation) |
| 15 | |
| 16 | **Paid tools (API key or MCP required):** |
| 17 | - [ ] Google Search Console API (requires OAuth) |
| 18 | - [ ] DataForSEO MCP (SERP data, keyword metrics, backlinks) |
| 19 | - [ ] Ahrefs API (backlink profiles, keyword research) |
| 20 | - [ ] Semrush API (competitive analysis, keyword data) |
| 21 | |
| 22 | **The agent must:** |
| 23 | 1. Present this checklist to the user |
| 24 | 2. Record which tools are available |
| 25 | 3. Pass the inventory to scripts as context |
| 26 | 4. Fall back gracefully — every check has a free-tier path using WebFetch/WebSearch |
| 27 | |
| 28 | Run `scripts/inventory-tools.py` to auto-detect available tools and generate a `tools.json` inventory for other scripts. |
| 29 | |
| 30 | ## Before you start |
| 31 | |
| 32 | Gather the following from the user. If anything is missing, ask before proceeding: |
| 33 | |
| 34 | 1. **What is the site URL?** (Production domain, including whether www or non-www is canonical) |
| 35 | 2. **What platform/framework is the site built on?** (Next.js, WordPress, Shopify, custom SPA — determines rendering model, common pitfalls, and available tooling) |
| 36 | 3. **What is the hosting/CDN provider?** (Vercel, Cloudflare, AWS CloudFront, Netlify — CDN configuration directly affects both search and AI crawler access) |
| 37 | 4. **What is the current robots.txt status?** (Existing file contents, or confirmation that none exists) |
| 38 | 5. **Do you have Google Search Console access?** (Required for crawl stats, index coverage, and Core Web Vitals field data) |
| 39 | 6. **Does AI visibility matter for this site?** (If the site sells products, services, or publishes information that users ask AI assistants about, the answer is almost certainly yes) |
| 40 | 7. **Are there known technical issues?** (Recent migration, traffic drop, indexation problems, CWV failures, rendering issues) |
| 41 | 8. **What CMS or deployment workflow do you use?** (Determines how changes to robots.txt, sitemaps, meta tags, and structured data get deployed) |
| 42 | |
| 43 | ## Technical SGEO implementation template |
| 44 | |
| 45 | ### 1. Measurement Infrastructure |
| 46 | |
| 47 | > **Scripts:** Run `scripts/inventory-tools.py` to detect available tools. |
| 48 | > **References:** See `references/measurement-setup.md` for detailed GSC/GA4/Bing setup walkthrough and AI referrer tracking configuration. |
| 49 | |
| 50 | Set up tracking before making changes. Without measurement, you cannot verify that implementations work or detect regressions. |
| 51 | |
| 52 | ``` |
| 53 | | Check | Status | Action | Priority | |
| 54 | |------------------------------|--------|----------------------------------------------------------|----------| |
| 55 | | Google Search Console (GSC) | [ ] | Verify ownership, submit XML sitemap, review index report | High | |
| 56 | | Google Analytics 4 (GA4) | [ ] | Install tracking, configure key events, link to GSC | High | |
| 57 | | Bing Webmaster Tools | [ ] | Verify site — Bing data feeds into Microsoft Copilot, | Medium | |
| 58 | | | | ChatGPT (via Bing API), and other AI systems | | |
| 59 | | Server log access | [ ] | Confirm ability to query logs for bot user-agents: | Medium | |
| 60 | | | | Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot | |
| 61 | | CrUX / PageSpeed Insights | [ ] | Verify field data availability for Core Web Vitals | Medium | |
| 62 | ``` |
| 63 | |
| 64 | **Why Bing Webmaster Tools matters for AI visibility:** Bing's index is the retrieval layer for multiple AI systems including Microsoft Copilot and ChatGPT's browsing feature. A site that is well-indexed in Bing has a structural advantage in AI citation. Set it up — it takes 10 minutes. |
| 65 | |
| 66 | **Server log monitoring for AI crawlers:** Traditional analytics (GA4) does not capture bot traffic. Server logs are the only way to see how often AI crawlers visit, which pages they request, and whether they receive 200 responses. Set up a log query or dashboard filtered to known AI bot user-agents. |
| 67 | |
| 68 | ### 2. Crawlability for Search Engines |
| 69 | |
| 70 | > **Scripts:** Run `scripts/check-robots-txt.py` to audit robots.txt rules. Run `scripts/validate-sitemap.py` to validate your XML sitemap. Run `scripts/check-redirect-chains.py` to find redirect chains. |
| 71 | > **References:** See `references/crawlability.md` for deep context on crawl mechanics, robots.txt syntax, and CDN bot-management gotchas. |
| 72 | |
| 73 | Search engines must discover, access, and render every page you want indexed. Crawlability failures are silent — pages simply do not appear in results. |
| 74 | |
| 75 | ``` |
| 76 | | Check | Status | Action | Priority | |
| 77 | |--------------------------|--------|-------------------------------------------------------------|----------| |
| 78 | | robots.txt | [ ] | Allow Googlebot and Bingbot access to all indexable content. | High | |
| 79 | | | | Block: /admin, /api, /internal, faceted navigation paths | | |
| 80 | | XML sitemap | [ ] | Create/validate sitemap. Include only 200-status canonical | High | |
| 81 | | | | URLs. Submit to GSC and Bing Webmaster Tools | | |
| 82 | | Crawl budget waste | [ ] | Eliminate faceted URLs, parameter variations, and duplicate | High | |
| 83 | | | | paths from crawlable pages. Use robots.txt or noindex. | | |
| 84 | | Redirect chains | [ ] | Audit all redirects. Maximum 2 hops. Update internal links | Medium | |
| 85 | | | | to point to final destinations directly | | |
| 86 | | Server errors (5xx) | [ ] | Check GSC Coverage report for server errors. Aim for zero | High | |
| 87 | | | | 5xx on any crawled URL | | |
| 88 | | Soft 404s | [ ] | Identify pages returning 200 status but displaying error | Medium | |
| 89 | | | | content. Configure proper 404 responses | | |
| 90 | | JavaScript rendering | [ ] | Verify rendered HTML matches intended content. Use GSC URL | High | |
| 91 | | | | Inspection "View Tested Page" to see what Google renders | | |
| 92 | ``` |
| 93 | |
| 94 | **XML sitemap requirements:** |
| 95 | - Only include URLs that return 200 and have a self-referencing canonical tag |
| 96 | - Keep sitemap under 50,000 URLs or 50MB uncompressed per file (use sitemap index for larger sites) |
| 97 | - Set `<lastmod>` dates to actual content modification dates, not the current date |
| 98 | - Validate with a sitemap validator before submission |
| 99 | |
| 100 | **JavaScript rendering verification:** If the site uses client-side rendering (React SPA, Angular, Vue without SSR), Google's crawler may not see the content. Test by comparing the raw HTML source with the rendered DOM in GSC URL Inspection. If critical content is missing from the raw source and only appears after JavaScript execution, implement server-side rendering (SSR) or static site generation (SSG) for indexable pages. |
| 101 | |
| 102 | ### 3. Crawlability for AI Engines |
| 103 | |
| 104 | > **Scripts:** Run `scripts/check-ai-crawler-access.py` to test whether AI crawlers can reach your pages. |
| 105 | > **References:** See `references/ai-crawler-access.md` for the complete AI bot user-agent table and CDN configuration guides per provider. |
| 106 | |
| 107 | AI crawlers follow similar mechanics to search crawlers — they request pages via HTTP and read the response. But they have different user-agents, different CDN treatment, and different content consumption patterns. This section covers what to verify beyond standard search engine crawlability. |
| 108 | |
| 109 | **robots.txt for AI crawlers:** |
| 110 | |
| 111 | Check your robots.txt for rules affecting these user-agents: |
| 112 | |
| 113 | ``` |
| 114 | | Bot | Operator | What it feeds | Recommended | |
| 115 | |------------------|--------------|----------------------------|-------------| |
| 116 | | GPTBot | OpenAI | ChatGPT training + browse | Allow | |
| 117 | | ChatGPT-User | OpenAI | ChatGPT live browsing | Allow | |
| 118 | | OAI-SearchBot | OpenAI | ChatGPT search results | Allow | |
| 119 | | ClaudeBot | Anthropic | Claude training + retrieval | Allow | |
| 120 | | PerplexityBot | Perplexity | Perplexity search answers | Allow | |
| 121 | | Google-Extended | Google | Gemini training | Allow | |
| 122 | | Bytespider | ByteDance | TikTok AI features | Evaluate | |
| 123 | ``` |
| 124 | |
| 125 | If your goal is AI visibility, do not block these bots. Many robots.txt files inherited blocks from a period when site owners were uncertain about AI crawling. Review and remove blocks that conflict with your visibility goals. |
| 126 | |
| 127 | **CDN and WAF configuration:** |
| 128 | |
| 129 | This is a common source of accidental AI bot blocking: |
| 130 | |
| 131 | - **Cloudflare:** Bot Fight Mode and Super Bot Fight Mode may block AI crawlers by default. Check Security > Bots settings. Verified bots (Googlebot) are typically allowed, but AI crawlers may not be on the verified list. Create explicit Allow rules for AI bot user-agents if using aggressive bot management. |
| 132 | - **AWS CloudFront + WAF:** AWS WAF bot control rules may categorize AI crawlers as "unauthorized." Review your WAF rule groups. |
| 133 | - **Other CDNs/WAFs:** Akamai, Fastly, Sucuri, and similar services each have bot management features. Verify AI crawlers are not caught in blanket bot-blocking rules. |
| 134 | |
| 135 | **Action:** After configuring, verify by checking server logs for successful 200 responses to AI crawler requests. If you see no AI crawler traffic at all, the CDN/WAF is likely blocking before requests reach your origin server. |
| 136 | |
| 137 | **Content accessibility for AI consumption:** |
| 138 | |
| 139 | AI crawlers generally cannot: |
| 140 | - Execute JavaScript (they read raw HTML responses) |
| 141 | - Authenticate or log in |
| 142 | - Bypass paywalls or cookie consent walls that hide content |
| 143 | - Process content inside iframes from different origins |
| 144 | |
| 145 | For pages you want AI systems to cite, ensure the substantive content is present in the initial HTML response, not loaded via client-side JavaScript, and not gated behind interactions. |
| 146 | |
| 147 | **llms.txt consideration:** |
| 148 | |
| 149 | The llms.txt proposal (a plain-text file at `/llms.txt` summarizing site content for LLMs) has gained discussion but limited measurable impact. Research findings: |
| 150 | |
| 151 | - SE Ranking analysis of 300K domains: no correlation between llms.txt presence and AI visibility |
| 152 | - OtterlyAI 90-day study: no measurable impact on AI citation rates |
| 153 | - ALLMO analysis of 94K+ URLs: no statistically significant benefit detected |
| 154 | |
| 155 | Adding an llms.txt file is low effort and does no harm. But it should not take priority over the fundamentals in this guide — crawlability, rendering, structured data, and content quality drive AI visibility far more than a summary file. |
| 156 | |
| 157 | ### 4. Indexation Control |
| 158 | |
| 159 | > **Scripts:** Run `scripts/check-indexation.py` to estimate indexed vs submitted page counts. |
| 160 | |
| 161 | Control which pages appear in search results. Every indexed page competes for crawl budget and can dilute topical authority if it is low-quality or duplicated. |
| 162 | |
| 163 | ``` |
| 164 | | Check | Status | Action | Priority | |
| 165 | |---------------------------|--------|-----------------------------------------------------------|----------| |
| 166 | | Canonical tags | [ ] | Every indexable page has a self-referencing canonical. | High | |
| 167 | | | | Cross-domain canonicals point to the authoritative version | | |
| 168 | | noindex for low-value | [ ] | Apply noindex to: tag/archive pages, internal search | Medium | |
| 169 | | pages | | results, paginated listing pages beyond page 1, thank-you | | |
| 170 | | | | pages, utility pages with no search value | | |
| 171 | | Duplicate content | [ ] | Identify URL variations (trailing slash, parameters, www | High | |
| 172 | | | | vs non-www, HTTP vs HTTPS) that serve identical content. | | |
| 173 | | | | Resolve with canonical tags and 301 redirects | | |
| 174 | | Index coverage (GSC) | [ ] | Compare submitted pages (sitemap) vs indexed pages in GSC. | High | |
| 175 | | | | Investigate gaps — "Discovered - currently not indexed" | | |
| 176 | | | | and "Crawled - currently not indexed" require action | | |
| 177 | ``` |
| 178 | |
| 179 | **Canonical tag implementation rules:** |
| 180 | 1. Every indexable page gets a self-referencing canonical: `<link rel="canonical" href="https://example.com/page/" />` |
| 181 | 2. Use absolute URLs, not relative paths |
| 182 | 3. Include the canonical in the `<head>`, not the `<body>` |
| 183 | 4. Canonical URLs must return 200 status (not redirect) |
| 184 | 5. Canonical must match the protocol (HTTPS) and domain (www vs non-www) you want indexed |
| 185 | |
| 186 | **Indexation gap analysis:** In GSC, navigate to Pages > Indexing. The "Why pages aren't indexed" section lists specific reasons. The most actionable categories are: |
| 187 | - "Discovered - currently not indexed" — Google found the URL but chose not to index it. Usually a quality or crawl budget signal. Improve the content or consolidate with a stronger page. |
| 188 | - "Crawled - currently not indexed" — Google fetched the page but decided not to index it. Content may be thin, duplicative, or low-value. |
| 189 | - "Blocked by robots.txt" — Unintentional blocks. Fix immediately if the page should be indexed. |
| 190 | |
| 191 | ### 5. Core Web Vitals |
| 192 | |
| 193 | > **Scripts:** Run `scripts/check-cwv.py` to pull PageSpeed Insights data (field + lab, LCP element identification). |
| 194 | > **References:** See `references/core-web-vitals.md` for fix patterns by framework (Next.js, WordPress, Shopify) and debugging workflows. |
| 195 | |
| 196 | Core Web Vitals are a confirmed Google ranking factor. They also affect user experience, which affects engagement metrics that influence both search ranking and AI citation (AI systems learn from pages with higher engagement and authority signals). |
| 197 | |
| 198 | ``` |
| 199 | | Metric | What It Measures | Good | Needs Work | Poor | |
| 200 | |--------|---------------------------|----------|------------|----------| |
| 201 | | LCP | Largest Contentful Paint | < 2.5s | 2.5-4.0s | > 4.0s | |
| 202 | | INP | Interaction to Next Paint | < 200ms | 200-500ms | > 500ms | |
| 203 | | CLS | Cumulative Layout Shift | < 0.1 | 0.1-0.25 | > 0.25 | |
| 204 | ``` |
| 205 | |
| 206 | **Field data vs lab data:** Field data (Chrome User Experience Report / CrUX, accessible via PageSpeed Insights or GSC) reflects real users on real devices and networks. Lab data (Lighthouse, WebPageTest) reflects a simulated environment. Google uses field data for ranking decisions. If field and lab data disagree, field data is the source of truth. Sites with low traffic may not have field data — note this limitation and use lab data as a proxy. |
| 207 | |
| 208 | **LCP optimization actions:** |
| 209 | 1. Identify the LCP element (usually hero image, heading, or video poster). Use PageSpeed Insights — it identifies the element. |
| 210 | 2. If image: serve in WebP/AVIF, properly sized, with `fetchpriority="high"` and no lazy loading on the LCP image |
| 211 | 3. If text: ensure fonts load quickly — use `font-display: swap`, preload critical fonts |
| 212 | 4. Reduce server response time (TTFB) — target under 800ms. TTFB directly delays LCP. |
| 213 | 5. Remove render-blocking CSS and JS from the critical path |
| 214 | |
| 215 | **INP optimization actions:** |
| 216 | 1. Identify slow interactions using Chrome DevTools Performance panel or Web Vitals extension |
| 217 | 2. Break up long tasks (>50ms) on the main thread — use `requestIdleCallback`, Web Workers, or `scheduler.yield()` |
| 218 | 3. Reduce JavaScript bundle size — every KB of JS must be parsed and compiled |
| 219 | 4. Defer non-critical third-party scripts (analytics, chat widgets, A/B testing) |
| 220 | |
| 221 | **CLS optimization actions:** |
| 222 | 1. Set explicit `width` and `height` attributes on images and videos |
| 223 | 2. Reserve space for ads and dynamically injected content with CSS `min-height` |
| 224 | 3. Avoid inserting content above existing visible content after page load |
| 225 | 4. Use CSS `contain` on elements that resize independently |
| 226 | |
| 227 | ### 6. Mobile-First and HTTPS |
| 228 | |
| 229 | > **Scripts:** Run `scripts/check-mobile.py` for mobile-friendliness checks. Run `scripts/check-https-security.py` to verify HTTPS and HSTS. |
| 230 | |
| 231 | Google uses mobile-first indexing — the mobile version of your site is the version Google crawls and indexes. As of 2024, 62.73% of global web traffic comes from mobile devices. AI systems also primarily consume the same content Google indexes. |
| 232 | |
| 233 | **Mobile verification checklist:** |
| 234 | |
| 235 | ``` |
| 236 | | Check | Status | Action | Priority | |
| 237 | |-------------------------|--------|------------------------------------------------------------|----------| |
| 238 | | Responsive design | [ ] | Site renders correctly across viewport widths 320px-1440px | High | |
| 239 | | Viewport meta tag | [ ] | <meta name="viewport" content="width=device-width, | High | |
| 240 | | | | initial-scale=1"> present in <head> | | |
| 241 | | Tap targets | [ ] | Interactive elements are at least 48x48px with 8px spacing | Medium | |
| 242 | | Text sizing | [ ] | Base font size >= 16px. No text requires zooming to read | Medium | |
| 243 | | Content parity | [ ] | Mobile version has the same content as desktop — no hidden | High | |
| 244 | | | | sections, collapsed accordions with critical content, or | | |
| 245 | | | | mobile-only reduced content | | |
| 246 | | No intrusive interstitials | [ ] | No full-screen popups that block content on mobile. Google | Medium | |
| 247 | | | | demotes pages with intrusive interstitials | | |
| 248 | ``` |
| 249 | |
| 250 | **HTTPS implementation:** |
| 251 | |
| 252 | HTTPS is a non-negotiable baseline. Google has used HTTPS as a ranking signal since 2014. AI crawlers also prefer HTTPS endpoints. |
| 253 | |
| 254 | - Verify all pages are served over HTTPS |
| 255 | - HTTP requests 301 redirect to HTTPS equivalents |
| 256 | - No mixed content warnings (HTTP resources loaded on HTTPS pages) |
| 257 | - HSTS header is set: `Strict-Transport-Security: max-age=31536000; includeSubDomains` |
| 258 | - SSL certificate is valid and auto-renews |
| 259 | |
| 260 | ### 7. Structured Data Foundation |
| 261 | |
| 262 | > **Scripts:** Run `scripts/check-structured-data.py` to extract and validate JSON-LD from any page. |
| 263 | > **References:** See `references/structured-data.md` for complete JSON-LD templates per page type and validation workflow. |
| 264 | |
| 265 | Structured data (Schema.org JSON-LD) helps search engines understand page content precisely and enables rich results. For AI systems, structured data provides machine-readable facts that are easier to extract and cite accurately than unstructured text. |
| 266 | |
| 267 | **Implementation by page type:** |
| 268 | |
| 269 | ``` |
| 270 | | Page Type | Schema Type | Key Properties | Rich Result | |
| 271 | |----------------|-----------------|----------------------------------------------------|-------------| |
| 272 | | Homepage | Organization | name, url, logo, sameAs (social profiles), | Knowledge | |
| 273 | | | | contactPoint | Panel | |
| 274 | | Product pages | Product | name, description, image, offers (price, currency, | Product | |
| 275 | | | | availability), aggregateRating, review | snippet | |
| 276 | | Blog/articles | Article | headline, datePublished, dateModified, author, | Article | |
| 277 | | | | image, publisher | snippet | |
| 278 | | FAQ pages | FAQPage | mainEntity array of Question + acceptedAnswer | FAQ rich | |
| 279 | | | | | result | |
| 280 | | Service pages | Service | name, description, provider, areaServed, | — | |
| 281 | | | | serviceType | | |
| 282 | | Local business | LocalBusiness | name, address, telephone, openingHoursSpecification| Local pack | |
| 283 | ``` |
| 284 | |
| 285 | **JSON-LD implementation template (Organization — homepage):** |
| 286 | |
| 287 | ```json |
| 288 | <script type="application/ld+json"> |
| 289 | { |
| 290 | "@context": "https://schema.org", |
| 291 | "@type": "Organization", |
| 292 | "name": "Your Company Name", |
| 293 | "url": "https://example.com", |
| 294 | "logo": "https://example.com/logo.png", |
| 295 | "sameAs": [ |
| 296 | "https://twitter.com/yourcompany", |
| 297 | "https://linkedin.com/company/yourcompany" |
| 298 | ], |
| 299 | "contactPoint": { |
| 300 | "@type": "ContactPoint", |
| 301 | "telephone": "+1-555-555-5555", |
| 302 | "contactType": "customer service" |
| 303 | } |
| 304 | } |
| 305 | </script> |
| 306 | ``` |
| 307 | |
| 308 | **FAQ Schema for AI citation potential:** FAQ pages with properly implemented FAQPage schema serve dual purposes. Search engines may display FAQ rich results (though Google has reduced eligibility). AI systems frequently cite well-structured Q&A content because the question-answer format maps directly to how users query AI assistants. Implement FAQPage schema on any page with genuine Q&A content. |
| 309 | |
| 310 | **Validation:** |
| 311 | 1. Test every page type with Google's Rich Results Test (https://search.google.com/test/rich-results) |
| 312 | 2. Verify in GSC under Enhancements — check for errors and warnings |
| 313 | 3. Structured data must match visible page content. Marking up content that is not visible to users risks a manual action from Google. |
| 314 | |
| 315 | ### 8. Verification Checklist |
| 316 | |
| 317 | After completing the implementation sections above, run through this unified SEO + GEO technical readiness checklist. Every item should pass before considering the technical foundation complete. |
| 318 | |
| 319 | **Crawlability and Access:** |
| 320 | - [ ] robots.txt allows Googlebot, Bingbot, and target AI crawlers (GPTBot, ClaudeBot, PerplexityBot) |
| 321 | - [ ] robots.txt blocks only non-indexable paths (/admin, /api, /internal, faceted navigation) |
| 322 | - [ ] XML sitemap is valid, submitted to GSC and Bing, contains only 200-status canonical URLs |
| 323 | - [ ] CDN/WAF is not blocking AI crawlers — verified via server logs showing 200 responses |
| 324 | - [ ] No redirect chains exceed 2 hops |
| 325 | - [ ] Zero 5xx server errors on crawled URLs |
| 326 | - [ ] JavaScript-rendered content is verified accessible to Googlebot via URL Inspection |
| 327 | |
| 328 | **Indexation:** |
| 329 | - [ ] Every indexable page has a self-referencing canonical tag with absolute URL |
| 330 | - [ ] Low-value pages have noindex applied |
| 331 | - [ ] Duplicate content resolved via canonicals and 301 redirects |
| 332 | - [ ] GSC index coverage reviewed — "not indexed" reasons investigated and addressed |
| 333 | - [ ] URL parameter handling configured to prevent duplicate indexation |
| 334 | |
| 335 | **Performance:** |
| 336 | - [ ] LCP under 2.5s (field data, or lab data if field unavailable) |
| 337 | - [ ] INP under 200ms |
| 338 | - [ ] CLS under 0.1 |
| 339 | - [ ] TTFB under 800ms |
| 340 | |
| 341 | **Mobile and Security:** |
| 342 | - [ ] Responsive design verified across 320px-1440px viewports |
| 343 | - [ ] Viewport meta tag present |
| 344 | - [ ] All pages served over HTTPS with valid certificate |
| 345 | - [ ] HTTP to HTTPS redirects in place |
| 346 | - [ ] HSTS header configured |
| 347 | |
| 348 | **Structured Data:** |
| 349 | - [ ] Organization schema on homepage |
| 350 | - [ ] Relevant schema type implemented per page type (Product, Article, FAQPage, etc.) |
| 351 | - [ ] All structured data passes Rich Results Test without errors |
| 352 | - [ ] Structured data matches visible page content |
| 353 | |
| 354 | **AI-Specific Access:** |
| 355 | - [ ] Server-side rendered content available in initial HTML for AI crawlers |
| 356 | - [ ] No critical content gated behind JavaScript-only rendering, authentication, or cookie walls |
| 357 | - [ ] Server logs confirm AI crawler visits are receiving 200 responses |
| 358 | - [ ] Bing Webmaster Tools verified (feeds Microsoft Copilot, ChatGPT browsing) |
| 359 | |
| 360 | ## Quality checklist |
| 361 | |
| 362 | Before delivering this implementation, verify: |
| 363 | |
| 364 | - [ ] All eight sections are completed with specific findings, not generic advice |
| 365 | - [ ] Measurement infrastructure is set up first — changes can be verified |
| 366 | - [ ] robots.txt addresses both search engine and AI crawler access |
| 367 | - [ ] Core Web Vitals use field data where available, with lab data noted as supplementary |
| 368 | - [ ] Structured data is validated with Rich Results Test, not just visually inspected |
| 369 | - [ ] CDN/WAF configuration has been explicitly checked for AI crawler blocking |
| 370 | - [ ] Mobile verification covers content parity, not just responsive layout |
| 371 | - [ ] The Section 8 unified checklist passes — all items checked off |
| 372 | |
| 373 | ## Common mistakes to avoid |
| 374 | |
| 375 | - **Blocking AI crawlers accidentally.** CDN bot management features (Cloudflare Bot Fight Mode, AWS WAF bot control) often block AI crawlers by default. This is the most common cause of zero AI visibility for sites that should have it. Check CDN settings and verify with server logs. |
| 376 | - **Using lab data only for Core Web Vitals.** Lighthouse on a developer's MacBook Pro with fiber internet does not represent real users. Field data from CrUX is what Google uses for ranking. If your Lighthouse score is 95 but field LCP is 4.2s, you have a problem. |
| 377 | - **Fixing technical SEO without a content strategy.** A perfectly crawlable, fast, mobile-friendly site with thin content will not rank or get cited. Technical SGEO removes barriers — content quality and topical authority drive actual visibility. Pair this skill with content-sgeo and on-page-sgeo. |
| 378 | - **Ignoring JavaScript rendering.** 60% of Google searches result in zero clicks — users get answers directly from search results and AI systems. If your content is invisible without JavaScript execution, it is invisible to most of the discovery ecosystem. Verify rendered HTML. |
| 379 | - **Treating llms.txt as a priority over fundamentals.** Adding an llms.txt file before fixing crawlability, rendering, and structured data is optimizing the wrong layer. Current research shows no measurable impact from llms.txt. Implement the fundamentals first. |
| 380 | - **Setting up robots.txt once and never reviewing it.** New AI crawlers appear regularly. CDN providers update their bot management rules. Review robots.txt and CDN bot settings quarterly. |
| 381 | - **Implementing structured data that does not match visible content.** Marking up a product with a 4.5-star rating in schema when the page shows 3.2 stars is a manual action risk. Schema must reflect exactly what users see on the page. |
| 382 | - **Submitting sitemaps with non-canonical or non-200 URLs.** Every URL in the sitemap should return 200 and have a self-referencing canonical. Including redirects, 404s, or non-canonical URLs wastes crawl budget and sends conflicting signals. |
| 383 | |
| 384 | ## Available scripts |
| 385 | |
| 386 | Run these scripts to automate technical checks. Each script outputs JSON. Use `scripts/inventory-tools.py` first to detect available tools — all scripts fall back to free methods (WebFetch/WebSearch) when paid tools are unavailable. |
| 387 | |
| 388 | | Script | What it checks | Run it when | |
| 389 | |--------|---------------|-------------| |
| 390 | | `inventory-tools.py` | Available tools/APIs/MCPs | First — before any other script | |
| 391 | | `check-robots-txt.py` | robots.txt rules for search + AI bots | Starting any technical audit | |
| 392 | | `validate-sitemap.py` | XML sitemap structure and URL status | Starting any technical audit | |
| 393 | | `check-cwv.py` | Core Web Vitals via PageSpeed Insights | Evaluating page performance | |
| 394 | | `check-structured-data.py` | JSON-LD schema validation | Checking structured data implementation | |
| 395 | | `check-https-security.py` | HTTPS, redirects, HSTS, mixed content | Verifying security baseline | |
| 396 | | `check-ai-crawler-access.py` | AI bot accessibility (CDN/WAF blocking) | Diagnosing zero AI visibility | |
| 397 | | `check-mobile.py` | Mobile viewport, tap targets, content parity | Checking mobile-first readiness | |
| 398 | | `check-indexation.py` | Indexed pages vs sitemap count | Diagnosing indexation gaps | |
| 399 | | `check-redirect-chains.py` | Redirect chain length and status codes | Finding redirect issues | |
| 400 |