Proxy Support
First-class proxy configuration for routing requests through dedicated residential or datacenter providers. See the Proxying page for ethical guidance and planned scope.
If something isn’t shipped yet, you’ll find it in one of two places:
If you want to follow progress on a specific feature, the GitHub issue is the most granular source. If you want a high-level picture of where the project is headed, start here.
Proxy Support
First-class proxy configuration for routing requests through dedicated residential or datacenter providers. See the Proxying page for ethical guidance and planned scope.
Scaling & Distributed Pipelines
QScrape L2 Sites
Distributed Selector Cache
Versioned Docs
Per-release documentation snapshots so older API versions remain accessible.
Conda Forge Distribution
Publish Yosoi to Conda Forge to make installation as easy as possible for users in the conda ecosystem.
DOM-Enabled Scraping (A3Nodes)
A DAG and node-based pipeline architecture, actively in development, that will let Yosoi drive a real browser (Zendriver△) to fetch and render JavaScript-heavy pages. L2 sites (SPAs, framework-rendered apps) will no longer require you to run a browser automation tool yourself before passing HTML in.
This introduces a new class of selector that can capture DOM assertions, actions, and arrangements (A3) without needing a bulky LLM to drive.
Canonicalization Engine
A mode that merges selector caches across domains that serve the same content. If two sources (e.g. thewallstreetjournal.com and wsj.com) publish equivalent data, enabling canonicalization on a contract would collapse them into a single canonical selector set. Reducing redundant discovery and keeping extraction consistent. Implementation would likely involve tracing or logging logic to detect overlapping field mappings, then coercing the duplicate into the canonical domain’s cache. May warrant a separate repo/package.
Super Crawler
A first-class multi-page crawling mode. Currently Yosoi operates on single pages — there is no way to chain together multiple linked pages. Multi-page crawling is a common use-case that makes sense to include natively. This would likely require expanding the AI generation step so that stateful “steps” are defined for the LLM to replay its predecessor’s selectors and JavaScript actions needed to complete a crawl. With DOM-enabled rendering progressing rapidly via the A3Nodes work above, this is becoming increasingly viable.
Smart Batching of Tokens
Currently the pipeline sends N requests (one per contract field), each containing the full HTML, and hopes the LLM’s context window is large enough. This needs to be more defensive. A smart batching or context management system would use a lightweight lxml pre-processor to trim the HTML to a safe-enough, close-enough subset per contract — reducing token cost and avoiding context window overflows.
Dynamic Schema Builder
A JSON codegen tool that lets users define a schema in JSON and generates a Pydantic contract from it. This lowers the barrier for newcomers who find OOP-style contract definitions (ys.Contract) intimidating but still need the specificity that Pydantic models provide for structured LLM output. Useful for REST and MCP design workflows as well. The Prompt → Contract → Selectors → Content pipeline would gain a new entry point.
Custom Yosoi SLMs
Custom Small Language Model for fully offline private scraping ≤ 3 billion parameters.
△ Zendriver. cdpdriver. Python-based browser automation using the Chrome DevTools Protocol. https://github.com/cdpdriver/zendriver
○ React. Meta. JavaScript library for building user interfaces. https://react.dev/
◑ Vue. Evan You. Progressive JavaScript framework for building UIs. https://vuejs.org/
◇ Svelte. Rich Harris. Compiler-based frontend framework. https://svelte.dev/
★ Solid. Ryan Carniato. Fine-grained reactive UI library with no virtual DOM. https://www.solidjs.com/
⬡ Redis. Redis Ltd. In-memory data structure store used as a database, cache, and message broker. https://redis.io/docs/
◎ Turso. Turso. Edge-hosted distributed SQLite database. https://turso.tech/