Introduction
Yosoi; You Only Scrape Once (iteratively). The core idea: pay for LLM-based scraping once per domain, then scrape that domain forever using the cached selectors with no further AI costs.
Yosoi takes a pipelined approach, were every selector is treated as a cache. When a site redesigns and cached selectors go stale, Yosoi detects the breakage, re-discovers only the fields that changed, and merges the fresh selectors back into the selector cache. You only pay again for AI discovery again on that partial diff, not from scratch. Over time, the cost of keeping selectors current trends toward zero as layouts stabilize.
Most web scraping tools make you choose between cost and reliability. LLM-per-request APIs are accurate but expensive at scale. Regex heuristics are cheap but brittle. Yosoi takes a different approach: use an LLM once per domain to discover selectors, cache the result, and scrape indefinitely at near-zero cost.
Technical Notes
Pydantic △ is the foundation for contract definitions and data validation. You declare what you want to extract as a typed Pydantic model. Yosoi handles the rest.
Async native. The scraping pipeline is built on asyncio ○ from the ground up. scrape() is an async generator; process_urls() is fully non-blocking.
Concurrent background workers. Pipeline.process_urls() dispatches URLs as independent tasks via a TaskIQ◑ in-memory broker, distributing work across a configurable worker pool. Discovery and extraction can run in parallel with no external queue required.
How It Works
- You provide a URL and a data contract (a typed Pydantic model describing what to extract)
- Yosoi fetches the page and sends it to an LLM
- The LLM identifies stable selectors for each field in the contract
- Selectors are validated and stored in
.yosoi/selectors/ - Future scrapes use the cached selectors directly, with no LLM call needed
FAQs
What is Yosoi?
You Only Scrape Once (iteratively). Instead of calling an LLM on every scrape or relying on flaky regex heuristics, Yosoi discovers selectors automatically the first time and caches them. You pay once and extract at cost from then on. When selectors go stale, Yosoi re-discovers only the broken fields not the whole contract so costs stay minimal even as sites change.
How does Yosoi work?
You point it at a URL. Yosoi fetches the HTML, sends it to an LLM to identify stable CSS selectors, validates them, and caches the result. Every subsequent scrape of that domain uses the cached selectors directly with no LLM involved.
Why was Yosoi built?
For internal use at Cascading Labs. Every other API or LLM-per-request approach would have cost hundreds of thousands of dollars at scale. Yosoi was the only economically viable path, and we figured others had the same problem.
Does Yosoi work with JavaScript-rendered sites?
Yosoi fetches static HTML. For sites that require JavaScript to render content, use a headless browser (Playwright, Puppeteer) to render the page first, then pass the HTML to Yosoi. A native DOM fetcher is planned for our beta in mid-2026, but not is not yet stable.
What happens if a site redesigns its layout?
Some cached selectors will likely break. Yosoi detects stale selectors at scrape time and runs partial rediscovery — only the broken fields are sent back to the LLM. The fresh selectors are merged into the existing cache. You can also force a full rediscovery by deleting the domain’s file from .yosoi/selectors/ or passing --force.
Which LLM provider gives the best results?
Results are generally consistent across providers. Models with larger context windows and large parameter counts generally perform better. We like to use models hosted on Groq or OpenRouter for simplicity. But we support over 25+ providers for convienience.
References
△ Pydantic. Pydantic Services Inc. Data validation library for Python. https://docs.pydantic.dev/
○ asyncio. Python Software Foundation. Asynchronous I/O framework in the Python standard library. https://docs.python.org/3/library/asyncio.html
◑ TaskIQ. TaskIQ Contributors. Distributed task queue for Python with asyncio support, used internally as the broker for concurrent URL processing. https://taskiq-python.github.io/