engineeringbusiness

Technical SGEO Setup

Set up a site's technical foundation for visibility in both traditional search engines and AI platforms — covering crawlability, indexation, Core Web Vitals, structured data, mobile-first design, AI crawler access, and measurement infrastructure.

SEOGEOSGEOtechnical-SEOcrawlabilityCore-Web-VitalsAI-visibilitystructured-datarobots.txt

Works well with agents

SEO Specialist Agent Frontend Engineer Agent Performance Engineer Agent DevOps Engineer Agent

Works well with skills

Technical SEO Audit Performance Audit On-Page SGEO Optimization Content SGEO Strategy Off-Page SGEO Authority

$ npx skills add The-AI-Directory-Company/(…) --skill technical-sgeo

technical-sgeo/

SKILL.md28.3 KB

SKILL.md

Markdown

1
2	# Technical SGEO Setup
3
4	## Tool discovery
5
6	Before gathering project details, confirm which tools are available.
7	Ask the user directly — do not assume access to any external service.
8
9	Free tools (no API key required):
10	- [ ] WebFetch (fetch any public URL — robots.txt, sitemaps, pages)
11	- [ ] WebSearch (search engine queries for competitive analysis)
12	- [ ] Google PageSpeed Insights API (CWV data, no key needed for basic usage)
13	- [ ] Google Rich Results Test (structured data validation)
14	- [ ] Playwright MCP or Chrome DevTools MCP (browser automation)
15
16	Paid tools (API key or MCP required):
17	- [ ] Google Search Console API (requires OAuth)
18	- [ ] DataForSEO MCP (SERP data, keyword metrics, backlinks)
19	- [ ] Ahrefs API (backlink profiles, keyword research)
20	- [ ] Semrush API (competitive analysis, keyword data)
21
22	The agent must:
23	1. Present this checklist to the user
24	2. Record which tools are available
25	3. Pass the inventory to scripts as context
26	4. Fall back gracefully — every check has a free-tier path using WebFetch/WebSearch
27
28	Run `scripts/inventory-tools.py` to auto-detect available tools and generate a `tools.json` inventory for other scripts.
29
30	## Before you start
31
32	Gather the following from the user. If anything is missing, ask before proceeding:
33
34	1. What is the site URL? (Production domain, including whether www or non-www is canonical)
35	2. What platform/framework is the site built on? (Next.js, WordPress, Shopify, custom SPA — determines rendering model, common pitfalls, and available tooling)
36	3. What is the hosting/CDN provider? (Vercel, Cloudflare, AWS CloudFront, Netlify — CDN configuration directly affects both search and AI crawler access)
37	4. What is the current robots.txt status? (Existing file contents, or confirmation that none exists)
38	5. Do you have Google Search Console access? (Required for crawl stats, index coverage, and Core Web Vitals field data)
39	6. Does AI visibility matter for this site? (If the site sells products, services, or publishes information that users ask AI assistants about, the answer is almost certainly yes)
40	7. Are there known technical issues? (Recent migration, traffic drop, indexation problems, CWV failures, rendering issues)
41	8. What CMS or deployment workflow do you use? (Determines how changes to robots.txt, sitemaps, meta tags, and structured data get deployed)
42
43	## Technical SGEO implementation template
44
45	### 1. Measurement Infrastructure
46
47	> Scripts: Run `scripts/inventory-tools.py` to detect available tools.
48	> References: See `references/measurement-setup.md` for detailed GSC/GA4/Bing setup walkthrough and AI referrer tracking configuration.
49
50	Set up tracking before making changes. Without measurement, you cannot verify that implementations work or detect regressions.
51
52	```
53	\| Check \| Status \| Action \| Priority \|
54	\|------------------------------\|--------\|----------------------------------------------------------\|----------\|
55	\| Google Search Console (GSC) \| [ ] \| Verify ownership, submit XML sitemap, review index report \| High \|
56	\| Google Analytics 4 (GA4) \| [ ] \| Install tracking, configure key events, link to GSC \| High \|
57	\| Bing Webmaster Tools \| [ ] \| Verify site — Bing data feeds into Microsoft Copilot, \| Medium \|
58	\| \| \| ChatGPT (via Bing API), and other AI systems \| \|
59	\| Server log access \| [ ] \| Confirm ability to query logs for bot user-agents: \| Medium \|
60	\| \| \| Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot \|
61	\| CrUX / PageSpeed Insights \| [ ] \| Verify field data availability for Core Web Vitals \| Medium \|
62	```
63
64	Why Bing Webmaster Tools matters for AI visibility: Bing's index is the retrieval layer for multiple AI systems including Microsoft Copilot and ChatGPT's browsing feature. A site that is well-indexed in Bing has a structural advantage in AI citation. Set it up — it takes 10 minutes.
65
66	Server log monitoring for AI crawlers: Traditional analytics (GA4) does not capture bot traffic. Server logs are the only way to see how often AI crawlers visit, which pages they request, and whether they receive 200 responses. Set up a log query or dashboard filtered to known AI bot user-agents.
67
68	### 2. Crawlability for Search Engines
69
70	> Scripts: Run `scripts/check-robots-txt.py` to audit robots.txt rules. Run `scripts/validate-sitemap.py` to validate your XML sitemap. Run `scripts/check-redirect-chains.py` to find redirect chains.
71	> References: See `references/crawlability.md` for deep context on crawl mechanics, robots.txt syntax, and CDN bot-management gotchas.
72
73	Search engines must discover, access, and render every page you want indexed. Crawlability failures are silent — pages simply do not appear in results.
74
75	```
76	\| Check \| Status \| Action \| Priority \|
77	\|--------------------------\|--------\|-------------------------------------------------------------\|----------\|
78	\| robots.txt \| [ ] \| Allow Googlebot and Bingbot access to all indexable content. \| High \|
79	\| \| \| Block: /admin, /api, /internal, faceted navigation paths \| \|
80	\| XML sitemap \| [ ] \| Create/validate sitemap. Include only 200-status canonical \| High \|
81	\| \| \| URLs. Submit to GSC and Bing Webmaster Tools \| \|
82	\| Crawl budget waste \| [ ] \| Eliminate faceted URLs, parameter variations, and duplicate \| High \|
83	\| \| \| paths from crawlable pages. Use robots.txt or noindex. \| \|
84	\| Redirect chains \| [ ] \| Audit all redirects. Maximum 2 hops. Update internal links \| Medium \|
85	\| \| \| to point to final destinations directly \| \|
86	\| Server errors (5xx) \| [ ] \| Check GSC Coverage report for server errors. Aim for zero \| High \|
87	\| \| \| 5xx on any crawled URL \| \|
88	\| Soft 404s \| [ ] \| Identify pages returning 200 status but displaying error \| Medium \|
89	\| \| \| content. Configure proper 404 responses \| \|
90	\| JavaScript rendering \| [ ] \| Verify rendered HTML matches intended content. Use GSC URL \| High \|
91	\| \| \| Inspection "View Tested Page" to see what Google renders \| \|
92	```
93
94	XML sitemap requirements:
95	- Only include URLs that return 200 and have a self-referencing canonical tag
96	- Keep sitemap under 50,000 URLs or 50MB uncompressed per file (use sitemap index for larger sites)
97	- Set `<lastmod>` dates to actual content modification dates, not the current date
98	- Validate with a sitemap validator before submission
99
100	JavaScript rendering verification: If the site uses client-side rendering (React SPA, Angular, Vue without SSR), Google's crawler may not see the content. Test by comparing the raw HTML source with the rendered DOM in GSC URL Inspection. If critical content is missing from the raw source and only appears after JavaScript execution, implement server-side rendering (SSR) or static site generation (SSG) for indexable pages.
101
102	### 3. Crawlability for AI Engines
103
104	> Scripts: Run `scripts/check-ai-crawler-access.py` to test whether AI crawlers can reach your pages.
105	> References: See `references/ai-crawler-access.md` for the complete AI bot user-agent table and CDN configuration guides per provider.
106
107	AI crawlers follow similar mechanics to search crawlers — they request pages via HTTP and read the response. But they have different user-agents, different CDN treatment, and different content consumption patterns. This section covers what to verify beyond standard search engine crawlability.
108
109	robots.txt for AI crawlers:
110
111	Check your robots.txt for rules affecting these user-agents:
112
113	```
114	\| Bot \| Operator \| What it feeds \| Recommended \|
115	\|------------------\|--------------\|----------------------------\|-------------\|
116	\| GPTBot \| OpenAI \| ChatGPT training + browse \| Allow \|
117	\| ChatGPT-User \| OpenAI \| ChatGPT live browsing \| Allow \|
118	\| OAI-SearchBot \| OpenAI \| ChatGPT search results \| Allow \|
119	\| ClaudeBot \| Anthropic \| Claude training + retrieval \| Allow \|
120	\| PerplexityBot \| Perplexity \| Perplexity search answers \| Allow \|
121	\| Google-Extended \| Google \| Gemini training \| Allow \|
122	\| Bytespider \| ByteDance \| TikTok AI features \| Evaluate \|
123	```
124
125	If your goal is AI visibility, do not block these bots. Many robots.txt files inherited blocks from a period when site owners were uncertain about AI crawling. Review and remove blocks that conflict with your visibility goals.
126
127	CDN and WAF configuration:
128
129	This is a common source of accidental AI bot blocking:
130
131	- Cloudflare: Bot Fight Mode and Super Bot Fight Mode may block AI crawlers by default. Check Security > Bots settings. Verified bots (Googlebot) are typically allowed, but AI crawlers may not be on the verified list. Create explicit Allow rules for AI bot user-agents if using aggressive bot management.
132	- AWS CloudFront + WAF: AWS WAF bot control rules may categorize AI crawlers as "unauthorized." Review your WAF rule groups.
133	- Other CDNs/WAFs: Akamai, Fastly, Sucuri, and similar services each have bot management features. Verify AI crawlers are not caught in blanket bot-blocking rules.
134
135	Action: After configuring, verify by checking server logs for successful 200 responses to AI crawler requests. If you see no AI crawler traffic at all, the CDN/WAF is likely blocking before requests reach your origin server.
136
137	Content accessibility for AI consumption:
138
139	AI crawlers generally cannot:
140	- Execute JavaScript (they read raw HTML responses)
141	- Authenticate or log in
142	- Bypass paywalls or cookie consent walls that hide content
143	- Process content inside iframes from different origins
144
145	For pages you want AI systems to cite, ensure the substantive content is present in the initial HTML response, not loaded via client-side JavaScript, and not gated behind interactions.
146
147	llms.txt consideration:
148
149	The llms.txt proposal (a plain-text file at `/llms.txt` summarizing site content for LLMs) has gained discussion but limited measurable impact. Research findings:
150
151	- SE Ranking analysis of 300K domains: no correlation between llms.txt presence and AI visibility
152	- OtterlyAI 90-day study: no measurable impact on AI citation rates
153	- ALLMO analysis of 94K+ URLs: no statistically significant benefit detected
154
155	Adding an llms.txt file is low effort and does no harm. But it should not take priority over the fundamentals in this guide — crawlability, rendering, structured data, and content quality drive AI visibility far more than a summary file.
156
157	### 4. Indexation Control
158
159	> Scripts: Run `scripts/check-indexation.py` to estimate indexed vs submitted page counts.
160
161	Control which pages appear in search results. Every indexed page competes for crawl budget and can dilute topical authority if it is low-quality or duplicated.
162
163	```
164	\| Check \| Status \| Action \| Priority \|
165	\|---------------------------\|--------\|-----------------------------------------------------------\|----------\|
166	\| Canonical tags \| [ ] \| Every indexable page has a self-referencing canonical. \| High \|
167	\| \| \| Cross-domain canonicals point to the authoritative version \| \|
168	\| noindex for low-value \| [ ] \| Apply noindex to: tag/archive pages, internal search \| Medium \|
169	\| pages \| \| results, paginated listing pages beyond page 1, thank-you \| \|
170	\| \| \| pages, utility pages with no search value \| \|
171	\| Duplicate content \| [ ] \| Identify URL variations (trailing slash, parameters, www \| High \|
172	\| \| \| vs non-www, HTTP vs HTTPS) that serve identical content. \| \|
173	\| \| \| Resolve with canonical tags and 301 redirects \| \|
174	\| Index coverage (GSC) \| [ ] \| Compare submitted pages (sitemap) vs indexed pages in GSC. \| High \|
175	\| \| \| Investigate gaps — "Discovered - currently not indexed" \| \|
176	\| \| \| and "Crawled - currently not indexed" require action \| \|
177	```
178
179	Canonical tag implementation rules:
180	1. Every indexable page gets a self-referencing canonical: `<link rel="canonical" href="https://example.com/page/" />`
181	2. Use absolute URLs, not relative paths
182	3. Include the canonical in the `<head>`, not the `<body>`
183	4. Canonical URLs must return 200 status (not redirect)
184	5. Canonical must match the protocol (HTTPS) and domain (www vs non-www) you want indexed
185
186	Indexation gap analysis: In GSC, navigate to Pages > Indexing. The "Why pages aren't indexed" section lists specific reasons. The most actionable categories are:
187	- "Discovered - currently not indexed" — Google found the URL but chose not to index it. Usually a quality or crawl budget signal. Improve the content or consolidate with a stronger page.
188	- "Crawled - currently not indexed" — Google fetched the page but decided not to index it. Content may be thin, duplicative, or low-value.
189	- "Blocked by robots.txt" — Unintentional blocks. Fix immediately if the page should be indexed.
190
191	### 5. Core Web Vitals
192
193	> Scripts: Run `scripts/check-cwv.py` to pull PageSpeed Insights data (field + lab, LCP element identification).
194	> References: See `references/core-web-vitals.md` for fix patterns by framework (Next.js, WordPress, Shopify) and debugging workflows.
195
196	Core Web Vitals are a confirmed Google ranking factor. They also affect user experience, which affects engagement metrics that influence both search ranking and AI citation (AI systems learn from pages with higher engagement and authority signals).
197
198	```
199	\| Metric \| What It Measures \| Good \| Needs Work \| Poor \|
200	\|--------\|---------------------------\|----------\|------------\|----------\|
201	\| LCP \| Largest Contentful Paint \| < 2.5s \| 2.5-4.0s \| > 4.0s \|
202	\| INP \| Interaction to Next Paint \| < 200ms \| 200-500ms \| > 500ms \|
203	\| CLS \| Cumulative Layout Shift \| < 0.1 \| 0.1-0.25 \| > 0.25 \|
204	```
205
206	Field data vs lab data: Field data (Chrome User Experience Report / CrUX, accessible via PageSpeed Insights or GSC) reflects real users on real devices and networks. Lab data (Lighthouse, WebPageTest) reflects a simulated environment. Google uses field data for ranking decisions. If field and lab data disagree, field data is the source of truth. Sites with low traffic may not have field data — note this limitation and use lab data as a proxy.
207
208	LCP optimization actions:
209	1. Identify the LCP element (usually hero image, heading, or video poster). Use PageSpeed Insights — it identifies the element.
210	2. If image: serve in WebP/AVIF, properly sized, with `fetchpriority="high"` and no lazy loading on the LCP image
211	3. If text: ensure fonts load quickly — use `font-display: swap`, preload critical fonts
212	4. Reduce server response time (TTFB) — target under 800ms. TTFB directly delays LCP.
213	5. Remove render-blocking CSS and JS from the critical path
214
215	INP optimization actions:
216	1. Identify slow interactions using Chrome DevTools Performance panel or Web Vitals extension
217	2. Break up long tasks (>50ms) on the main thread — use `requestIdleCallback`, Web Workers, or `scheduler.yield()`
218	3. Reduce JavaScript bundle size — every KB of JS must be parsed and compiled
219	4. Defer non-critical third-party scripts (analytics, chat widgets, A/B testing)
220
221	CLS optimization actions:
222	1. Set explicit `width` and `height` attributes on images and videos
223	2. Reserve space for ads and dynamically injected content with CSS `min-height`
224	3. Avoid inserting content above existing visible content after page load
225	4. Use CSS `contain` on elements that resize independently
226
227	### 6. Mobile-First and HTTPS
228
229	> Scripts: Run `scripts/check-mobile.py` for mobile-friendliness checks. Run `scripts/check-https-security.py` to verify HTTPS and HSTS.
230
231	Google uses mobile-first indexing — the mobile version of your site is the version Google crawls and indexes. As of 2024, 62.73% of global web traffic comes from mobile devices. AI systems also primarily consume the same content Google indexes.
232
233	Mobile verification checklist:
234
235	```
236	\| Check \| Status \| Action \| Priority \|
237	\|-------------------------\|--------\|------------------------------------------------------------\|----------\|
238	\| Responsive design \| [ ] \| Site renders correctly across viewport widths 320px-1440px \| High \|
239	\| Viewport meta tag \| [ ] \| <meta name="viewport" content="width=device-width, \| High \|
240	\| \| \| initial-scale=1"> present in <head> \| \|
241	\| Tap targets \| [ ] \| Interactive elements are at least 48x48px with 8px spacing \| Medium \|
242	\| Text sizing \| [ ] \| Base font size >= 16px. No text requires zooming to read \| Medium \|
243	\| Content parity \| [ ] \| Mobile version has the same content as desktop — no hidden \| High \|
244	\| \| \| sections, collapsed accordions with critical content, or \| \|
245	\| \| \| mobile-only reduced content \| \|
246	\| No intrusive interstitials \| [ ] \| No full-screen popups that block content on mobile. Google \| Medium \|
247	\| \| \| demotes pages with intrusive interstitials \| \|
248	```
249
250	HTTPS implementation:
251
252	HTTPS is a non-negotiable baseline. Google has used HTTPS as a ranking signal since 2014. AI crawlers also prefer HTTPS endpoints.
253
254	- Verify all pages are served over HTTPS
255	- HTTP requests 301 redirect to HTTPS equivalents
256	- No mixed content warnings (HTTP resources loaded on HTTPS pages)
257	- HSTS header is set: `Strict-Transport-Security: max-age=31536000; includeSubDomains`
258	- SSL certificate is valid and auto-renews
259
260	### 7. Structured Data Foundation
261
262	> Scripts: Run `scripts/check-structured-data.py` to extract and validate JSON-LD from any page.
263	> References: See `references/structured-data.md` for complete JSON-LD templates per page type and validation workflow.
264
265	Structured data (Schema.org JSON-LD) helps search engines understand page content precisely and enables rich results. For AI systems, structured data provides machine-readable facts that are easier to extract and cite accurately than unstructured text.
266
267	Implementation by page type:
268
269	```
270	\| Page Type \| Schema Type \| Key Properties \| Rich Result \|
271	\|----------------\|-----------------\|----------------------------------------------------\|-------------\|
272	\| Homepage \| Organization \| name, url, logo, sameAs (social profiles), \| Knowledge \|
273	\| \| \| contactPoint \| Panel \|
274	\| Product pages \| Product \| name, description, image, offers (price, currency, \| Product \|
275	\| \| \| availability), aggregateRating, review \| snippet \|
276	\| Blog/articles \| Article \| headline, datePublished, dateModified, author, \| Article \|
277	\| \| \| image, publisher \| snippet \|
278	\| FAQ pages \| FAQPage \| mainEntity array of Question + acceptedAnswer \| FAQ rich \|
279	\| \| \| \| result \|
280	\| Service pages \| Service \| name, description, provider, areaServed, \| — \|
281	\| \| \| serviceType \| \|
282	\| Local business \| LocalBusiness \| name, address, telephone, openingHoursSpecification\| Local pack \|
283	```
284
285	JSON-LD implementation template (Organization — homepage):
286
287	```json
288	<script type="application/ld+json">
289	{
290	"@context": "https://schema.org",
291	"@type": "Organization",
292	"name": "Your Company Name",
293	"url": "https://example.com",
294	"logo": "https://example.com/logo.png",
295	"sameAs": [
296	"https://twitter.com/yourcompany",
297	"https://linkedin.com/company/yourcompany"
298	],
299	"contactPoint": {
300	"@type": "ContactPoint",
301	"telephone": "+1-555-555-5555",
302	"contactType": "customer service"
303	}
304	}
305	</script>
306	```
307
308	FAQ Schema for AI citation potential: FAQ pages with properly implemented FAQPage schema serve dual purposes. Search engines may display FAQ rich results (though Google has reduced eligibility). AI systems frequently cite well-structured Q&A content because the question-answer format maps directly to how users query AI assistants. Implement FAQPage schema on any page with genuine Q&A content.
309
310	Validation:
311	1. Test every page type with Google's Rich Results Test (https://search.google.com/test/rich-results)
312	2. Verify in GSC under Enhancements — check for errors and warnings
313	3. Structured data must match visible page content. Marking up content that is not visible to users risks a manual action from Google.
314
315	### 8. Verification Checklist
316
317	After completing the implementation sections above, run through this unified SEO + GEO technical readiness checklist. Every item should pass before considering the technical foundation complete.
318
319	Crawlability and Access:
320	- [ ] robots.txt allows Googlebot, Bingbot, and target AI crawlers (GPTBot, ClaudeBot, PerplexityBot)
321	- [ ] robots.txt blocks only non-indexable paths (/admin, /api, /internal, faceted navigation)
322	- [ ] XML sitemap is valid, submitted to GSC and Bing, contains only 200-status canonical URLs
323	- [ ] CDN/WAF is not blocking AI crawlers — verified via server logs showing 200 responses
324	- [ ] No redirect chains exceed 2 hops
325	- [ ] Zero 5xx server errors on crawled URLs
326	- [ ] JavaScript-rendered content is verified accessible to Googlebot via URL Inspection
327
328	Indexation:
329	- [ ] Every indexable page has a self-referencing canonical tag with absolute URL
330	- [ ] Low-value pages have noindex applied
331	- [ ] Duplicate content resolved via canonicals and 301 redirects
332	- [ ] GSC index coverage reviewed — "not indexed" reasons investigated and addressed
333	- [ ] URL parameter handling configured to prevent duplicate indexation
334
335	Performance:
336	- [ ] LCP under 2.5s (field data, or lab data if field unavailable)
337	- [ ] INP under 200ms
338	- [ ] CLS under 0.1
339	- [ ] TTFB under 800ms
340
341	Mobile and Security:
342	- [ ] Responsive design verified across 320px-1440px viewports
343	- [ ] Viewport meta tag present
344	- [ ] All pages served over HTTPS with valid certificate
345	- [ ] HTTP to HTTPS redirects in place
346	- [ ] HSTS header configured
347
348	Structured Data:
349	- [ ] Organization schema on homepage
350	- [ ] Relevant schema type implemented per page type (Product, Article, FAQPage, etc.)
351	- [ ] All structured data passes Rich Results Test without errors
352	- [ ] Structured data matches visible page content
353
354	AI-Specific Access:
355	- [ ] Server-side rendered content available in initial HTML for AI crawlers
356	- [ ] No critical content gated behind JavaScript-only rendering, authentication, or cookie walls
357	- [ ] Server logs confirm AI crawler visits are receiving 200 responses
358	- [ ] Bing Webmaster Tools verified (feeds Microsoft Copilot, ChatGPT browsing)
359
360	## Quality checklist
361
362	Before delivering this implementation, verify:
363
364	- [ ] All eight sections are completed with specific findings, not generic advice
365	- [ ] Measurement infrastructure is set up first — changes can be verified
366	- [ ] robots.txt addresses both search engine and AI crawler access
367	- [ ] Core Web Vitals use field data where available, with lab data noted as supplementary
368	- [ ] Structured data is validated with Rich Results Test, not just visually inspected
369	- [ ] CDN/WAF configuration has been explicitly checked for AI crawler blocking
370	- [ ] Mobile verification covers content parity, not just responsive layout
371	- [ ] The Section 8 unified checklist passes — all items checked off
372
373	## Common mistakes to avoid
374
375	- Blocking AI crawlers accidentally. CDN bot management features (Cloudflare Bot Fight Mode, AWS WAF bot control) often block AI crawlers by default. This is the most common cause of zero AI visibility for sites that should have it. Check CDN settings and verify with server logs.
376	- Using lab data only for Core Web Vitals. Lighthouse on a developer's MacBook Pro with fiber internet does not represent real users. Field data from CrUX is what Google uses for ranking. If your Lighthouse score is 95 but field LCP is 4.2s, you have a problem.
377	- Fixing technical SEO without a content strategy. A perfectly crawlable, fast, mobile-friendly site with thin content will not rank or get cited. Technical SGEO removes barriers — content quality and topical authority drive actual visibility. Pair this skill with content-sgeo and on-page-sgeo.
378	- Ignoring JavaScript rendering. 60% of Google searches result in zero clicks — users get answers directly from search results and AI systems. If your content is invisible without JavaScript execution, it is invisible to most of the discovery ecosystem. Verify rendered HTML.
379	- Treating llms.txt as a priority over fundamentals. Adding an llms.txt file before fixing crawlability, rendering, and structured data is optimizing the wrong layer. Current research shows no measurable impact from llms.txt. Implement the fundamentals first.
380	- Setting up robots.txt once and never reviewing it. New AI crawlers appear regularly. CDN providers update their bot management rules. Review robots.txt and CDN bot settings quarterly.
381	- Implementing structured data that does not match visible content. Marking up a product with a 4.5-star rating in schema when the page shows 3.2 stars is a manual action risk. Schema must reflect exactly what users see on the page.
382	- Submitting sitemaps with non-canonical or non-200 URLs. Every URL in the sitemap should return 200 and have a self-referencing canonical. Including redirects, 404s, or non-canonical URLs wastes crawl budget and sends conflicting signals.
383
384	## Available scripts
385
386	Run these scripts to automate technical checks. Each script outputs JSON. Use `scripts/inventory-tools.py` first to detect available tools — all scripts fall back to free methods (WebFetch/WebSearch) when paid tools are unavailable.
387
388	\| Script \| What it checks \| Run it when \|
389	\|--------\|---------------\|-------------\|
390	\| `inventory-tools.py` \| Available tools/APIs/MCPs \| First — before any other script \|
391	\| `check-robots-txt.py` \| robots.txt rules for search + AI bots \| Starting any technical audit \|
392	\| `validate-sitemap.py` \| XML sitemap structure and URL status \| Starting any technical audit \|
393	\| `check-cwv.py` \| Core Web Vitals via PageSpeed Insights \| Evaluating page performance \|
394	\| `check-structured-data.py` \| JSON-LD schema validation \| Checking structured data implementation \|
395	\| `check-https-security.py` \| HTTPS, redirects, HSTS, mixed content \| Verifying security baseline \|
396	\| `check-ai-crawler-access.py` \| AI bot accessibility (CDN/WAF blocking) \| Diagnosing zero AI visibility \|
397	\| `check-mobile.py` \| Mobile viewport, tap targets, content parity \| Checking mobile-first readiness \|
398	\| `check-indexation.py` \| Indexed pages vs sitemap count \| Diagnosing indexation gaps \|
399	\| `check-redirect-chains.py` \| Redirect chain length and status codes \| Finding redirect issues \|
400