Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Introduction

Yosoi means You Only Scrape Once (iteratively). The core idea: pay for LLM-based selector discovery once per domain, then use cached selectors on later scrape runs until something breaks.

Yosoi treats every selector as cacheable state. When a site redesigns and cached selectors go stale, Yosoi detects the breakage, re-discovers only the fields that changed, and merges the fresh selectors back into the cache. You pay again only for the partial diff, not for the whole contract.

Most web scraping tools make you choose between cost and reliability. LLM-per-request APIs are accurate but expensive at scale. Regex heuristics are cheap but brittle. Yosoi takes a different approach: use an LLM to discover selectors, cache the result, and avoid repeat LLM calls on cache hits.

Scraping APIsExpensive at scaleEasy to useCustom ScrapingCheap at scaleEffortful and flakyYosoiCheap at scaleEasy to use

Technical Notes

Pydantic is the foundation for contract definitions and data validation. You declare what you want to extract as a typed Pydantic model. Yosoi handles the rest.

Async native. The scraping pipeline is built on asyncio from the ground up. scrape() is an async generator; process_urls() is fully non-blocking.

Concurrent background workers. Pipeline.process_urls() dispatches URLs as independent tasks via a TaskIQ in-memory broker, distributing work across a configurable worker pool. Discovery and extraction can run in parallel with no external queue required.

How It Works

  1. You provide a URL and a data contract (a typed Pydantic model describing what to extract)
  2. Yosoi fetches the page and sends it to an LLM
  3. The LLM identifies stable selectors for each field in the contract
  4. Selectors are validated and stored in .yosoi/selectors/
  5. Future scrapes use the cached selectors directly, with no LLM call needed

Browser-backed fetches can also run DOMLoader, replay A3Node stability recipes, surface files through ys.File, and pass accessibility-tree hints into discovery before escalating to MCP.

First RunEvery Run AfterDefine ContractFetch HTMLAI DiscoveryVerify & CacheExtract DataStructured OutputDefine ContractFetch HTMLLoad Cached SelectorsExtract DataStructured OutputNo LLM neededcached selectors

FAQs

What is Yosoi?

You Only Scrape Once (iteratively). Instead of calling an LLM on every scrape or relying on brittle regex heuristics, Yosoi discovers selectors automatically the first time and caches them. Cache hits avoid repeat LLM calls. When selectors go stale, Yosoi re-discovers only the broken fields, not the whole contract, so costs stay bounded as sites change.

How does Yosoi work?

You point it at a URL. Yosoi fetches the HTML, sends it to an LLM to identify stable CSS selectors, validates them, and caches the result. Every subsequent scrape of that domain uses the cached selectors directly with no LLM involved.

Why was Yosoi built?

For internal use at Cascading Labs. Every other API or LLM-per-request approach would have cost hundreds of thousands of dollars at scale. Yosoi was the only economically viable path, and we figured others had the same problem.

Does Yosoi work with JavaScript-rendered sites?

Yes. Use the waterfall, headless, or headful fetcher. Browser-backed fetching runs through VoidCrawl, captures rendered HTML, and can feed accessibility-tree hints into selector discovery.

What happens if a site redesigns its layout?

Some cached selectors will likely break. Yosoi detects stale selectors at scrape time and runs partial rediscovery. Only the broken fields are sent back to the LLM. The fresh selectors are merged into the existing cache. You can also force a full rediscovery by deleting the domain’s file from .yosoi/selectors/ or passing --force.

Which LLM provider gives the best results?

Results are generally consistent across providers. Larger context windows and stronger models usually help on difficult pages. Yosoi supports many providers, so run discovery with --model to compare on your target sites.

References

Pydantic. Pydantic Services Inc. Data validation library for Python. https://docs.pydantic.dev/

asyncio. Python Software Foundation. Asynchronous I/O framework in the Python standard library. https://docs.python.org/3/library/asyncio.html

TaskIQ. TaskIQ Contributors. Distributed task queue for Python with asyncio support, used internally as the broker for concurrent URL processing. https://taskiq-python.github.io/