Debugging
When something goes wrong, Yosoi gives you a few tools to figure out what happened and where. This guide covers the --debug flag, common failure modes, and how to recover.
The --debug Flag
Pass --debug (or -d) to the CLI, or set debug=True on the Pipeline constructor, to save a snapshot of the HTML that was sent to the LLM:
uv run yosoi --url https://qscrape.dev --contract Product --debugpipeline = ys.Pipeline(ys.auto_config(), contract=Product, debug=True)Snapshots are saved to .yosoi/debug_html/ as plain HTML files, named by domain and timestamp. Open them in a browser to see exactly what the LLM was working with.
Common Failure Modes
1. API Key Invalid or Missing
Symptoms: An authentication error from the LLM provider immediately after discovery starts.
Fix: Check your .env file. The key name must match the provider format: GROQ_KEY, GEMINI_KEY, OPENAI_KEY, etc. Run with a simple test URL to confirm the key works before debugging anything else.
2. Target URL Is Inaccessible
Symptoms: Empty or near-empty debug HTML. The LLM returns selectors that don’t match anything.
Fix: Open the URL in a browser. If it requires JavaScript to render, Yosoi’s static fetcher won’t see the content. You need to pass pre-rendered HTML via a browser automation tool (Playwright△, Selenium○) or wait for the upcoming DOM-enabled scraping feature (see Roadmap).
3. Wrong Root Selector
Symptoms: Multi-item extraction yields one giant item (root too broad) or many items with mostly None fields (root too narrow).
Fix: Pin the root on your contract after inspecting the page source:
class Product(ys.Contract): root = ys.css('article.product') # pin the correct container name: str = ys.Title() price: float = ys.Price()See E-Commerce Catalogue: Automatic vs. Pinned Root for a detailed walkthrough.
4. Stale Selector Cache
Symptoms: Extraction used to work but now returns None for some or all fields. The target site has been redesigned.
Fix: Force re-discovery to clear the cached selectors for that domain:
uv run yosoi --url https://qscrape.dev --contract Product --forcepipeline = ys.Pipeline(ys.auto_config(), contract=Product, force=True)You can also manually delete the cache file from .yosoi/selectors/ and run again.
5. Context Window Overflow
Symptoms: The LLM returns truncated, garbled, or missing selectors. Large pages with deeply nested HTML are most susceptible.
Fix: There is no built-in mitigation yet (see Smart Batching on the roadmap). Current workarounds:
- Use a model with a larger context window (e.g. Gemini◑ models with 1M+ tokens)
- Trim the HTML yourself before passing it in
- Target a more specific page (e.g. a category page instead of the homepage)
6. Coercion / Validation Errors
Symptoms: Pydantic◇ ValidationError with field-level details. The selector matched, but the extracted text couldn’t be coerced to the field’s type.
Fix: Run with --debug to see the raw extracted values. Common causes:
- A
floatfield receives text like"$12.99"but the coercion type doesn’t strip currency. Useys.Price()instead of a barefloatfield. - A field expecting a single value receives a list (or vice versa). Check whether the field should be
list[str]orstr. - A custom type’s
coercefunction doesn’t handle edge cases. Add a descriptiveValueErrorto surface the issue.
Inspecting the Selector Cache
Cached selectors live in .yosoi/selectors/ as JSON files, one per domain. You can read them directly to see what the LLM discovered:
cat .yosoi/selectors/qscrape.dev.json | python -m json.toolEach entry maps a field name to a selector string. If a selector looks wrong, you have two options:
- Edit the cache file directly and re-run extraction (no LLM call needed)
- Pin the selector on the contract with
yosoi_selectorand force re-discovery
from pydantic import Field
class Product(ys.Contract): name: str = Field( description='Product name', json_schema_extra={'yosoi_selector': {'primary': 'h2.product-title'}}, )Checklist
When something breaks, work through this in order:
- Run with
--debugand inspect.yosoi/debug_html/ - Open the target URL in a browser — is the content visible without JavaScript?
- Check
.yosoi/selectors/— do the cached selectors look reasonable? - Try
--forceto re-discover from scratch - Pin the
rootif multi-item extraction is off - Switch to a larger-context model if the page is very large
References
△ Playwright. Microsoft. Browser automation library for end-to-end testing and web scraping. https://playwright.dev/python/
○ Selenium. OpenQA. Browser automation framework for web testing. https://www.selenium.dev/
◑ Gemini. Google. Large language model family with extended context windows. https://ai.google.dev/
◇ Pydantic. Pydantic Services Inc. Data validation library for Python. https://docs.pydantic.dev/