Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Debugging

When something goes wrong, Yosoi gives you a few tools to figure out what happened and where. This guide covers the --debug flag, common failure modes, and how to recover.

The --debug Flag

Pass --debug (or -d) to the CLI, or set debug=True on the Pipeline constructor, to save a snapshot of the HTML that was sent to the LLM:

uv run yosoi --url https://qscrape.dev --contract Product --debug
pipeline = ys.Pipeline(ys.auto_config(), contract=Product, debug=True)

Snapshots are saved to .yosoi/debug_html/ as plain HTML files, named by domain and timestamp. Open them in a browser to see exactly what the LLM was working with.

Common Failure Modes

1. API Key Invalid or Missing

Symptoms: An authentication error from the LLM provider immediately after discovery starts.

Fix: Check your .env file. The key name must match the provider format: GROQ_KEY, GEMINI_KEY, OPENAI_KEY, etc. Run with a simple test URL to confirm the key works before debugging anything else.

2. Target URL Is Inaccessible

Symptoms: Empty or near-empty debug HTML. The LLM returns selectors that don’t match anything.

Fix: Open the URL in a browser. If it requires JavaScript to render, Yosoi’s static fetcher won’t see the content. You need to pass pre-rendered HTML via a browser automation tool (Playwright, Selenium) or wait for the upcoming DOM-enabled scraping feature (see Roadmap).

3. Wrong Root Selector

Symptoms: Multi-item extraction yields one giant item (root too broad) or many items with mostly None fields (root too narrow).

Fix: Pin the root on your contract after inspecting the page source:

class Product(ys.Contract):
root = ys.css('article.product') # pin the correct container
name: str = ys.Title()
price: float = ys.Price()

See E-Commerce Catalogue: Automatic vs. Pinned Root for a detailed walkthrough.

4. Stale Selector Cache

Symptoms: Extraction used to work but now returns None for some or all fields. The target site has been redesigned.

Fix: Force re-discovery to clear the cached selectors for that domain:

uv run yosoi --url https://qscrape.dev --contract Product --force
pipeline = ys.Pipeline(ys.auto_config(), contract=Product, force=True)

You can also manually delete the cache file from .yosoi/selectors/ and run again.

5. Context Window Overflow

Symptoms: The LLM returns truncated, garbled, or missing selectors. Large pages with deeply nested HTML are most susceptible.

Fix: There is no built-in mitigation yet (see Smart Batching on the roadmap). Current workarounds:

  • Use a model with a larger context window (e.g. Gemini models with 1M+ tokens)
  • Trim the HTML yourself before passing it in
  • Target a more specific page (e.g. a category page instead of the homepage)

6. Coercion / Validation Errors

Symptoms: Pydantic ValidationError with field-level details. The selector matched, but the extracted text couldn’t be coerced to the field’s type.

Fix: Run with --debug to see the raw extracted values. Common causes:

  • A float field receives text like "$12.99" but the coercion type doesn’t strip currency. Use ys.Price() instead of a bare float field.
  • A field expecting a single value receives a list (or vice versa). Check whether the field should be list[str] or str.
  • A custom type’s coerce function doesn’t handle edge cases. Add a descriptive ValueError to surface the issue.

Inspecting the Selector Cache

Cached selectors live in .yosoi/selectors/ as JSON files, one per domain. You can read them directly to see what the LLM discovered:

cat .yosoi/selectors/qscrape.dev.json | python -m json.tool

Each entry maps a field name to a selector string. If a selector looks wrong, you have two options:

  1. Edit the cache file directly and re-run extraction (no LLM call needed)
  2. Pin the selector on the contract with yosoi_selector and force re-discovery
from pydantic import Field
class Product(ys.Contract):
name: str = Field(
description='Product name',
json_schema_extra={'yosoi_selector': {'primary': 'h2.product-title'}},
)

Checklist

When something breaks, work through this in order:

  1. Run with --debug and inspect .yosoi/debug_html/
  2. Open the target URL in a browser — is the content visible without JavaScript?
  3. Check .yosoi/selectors/ — do the cached selectors look reasonable?
  4. Try --force to re-discover from scratch
  5. Pin the root if multi-item extraction is off
  6. Switch to a larger-context model if the page is very large

References

Playwright. Microsoft. Browser automation library for end-to-end testing and web scraping. https://playwright.dev/python/

Selenium. OpenQA. Browser automation framework for web testing. https://www.selenium.dev/

Gemini. Google. Large language model family with extended context windows. https://ai.google.dev/

Pydantic. Pydantic Services Inc. Data validation library for Python. https://docs.pydantic.dev/