Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Quick Start

Install

uv add yosoi

Export or add a provider key file in your .env project root, see the list of all provider options

GROQ_KEY=your_groq_api_key
# or GEMINI_KEY, OPENAI_KEY, CEREBRAS_KEY, OPENROUTER_KEY, etc.

Option 1: CLI

Run discovery and extraction from the terminal without writing any Python.

uv run yosoi -m groq:llama-3.3-70b-versatile --url https://qscrape.dev/l1/eshop/catalog/?cat=Forge%20%26%20Smithing --contract Product

Yosoi fetches the page, asks the LLM to identify selectors, validates them, and saves the result to .yosoi/. Run the same command again and it reads from cache.

CLI flags

FlagDescription
--url, -uTarget URL to discover selectors for
--model, -mLLM model in provider:model format (e.g. groq:llama-3.3-70b-versatile)
--contract, -CContract schema (built-in name or path:Class)
--file, -fFile containing URLs (one per line, or JSON)
--output, -oOutput format: json, csv, md, jsonl, ndjson, xlsx, parquet
--workers, -wNumber of concurrent workers for batch processing
--force, -FForce re-discovery even if selectors exist
--debug, -dSave extracted HTML to .yosoi/debug_html/
--selector-level, -xMax selector strategy: css, xpath, regex, jsonld, all (default: css)

Option 2: Python script

Use the Python API when you need to process results programmatically.

import asyncio
import yosoi as ys
# This Contract is actually built-in, you can import from yosoi.models.defaults!
class Product(ys.Contract):
"""Contract for e-commerce product pages."""
name: str = ys.Title(description='Product name or title')
price: float | None = ys.Price(description='Product price (including currency symbol)')
rating: float | None = ys.Rating(as_float=True, description='Star rating or review score (numeric, e.g. 4.2)')
reviews_count: int | None = Field(description='Number of reviews or ratings')
description: str = ys.BodyText(description='Product description or summary')
availability: str = Field(description='Stock status (e.g. "In Stock", "Out of Stock")')
async def main():
pipeline = ys.Pipeline(ys.auto_config(), contract=Product)
async for item in pipeline.scrape('https://qscrape.dev/l1/eshop/catalog/?cat=Forge%20%26%20Smithing'):
print(item.get('name'), item.get('price'))
asyncio.run(main())

Note

As long as you set your key in .env this example can be run as is

Discovery only happens once per domain. Subsequent calls read from the local cache with no LLM calls and no cost.

What Just Happened?

Hopefully your terminal explained most of it, but here is the rundown:

  1. Yosoi fetched the static HTML from the QScrape target URL

  2. The HTML and your Product contract were sent to the LLM

  3. The LLM identified CSS selectors for each field in the contract (name, price, rating, etc.)

  4. Selectors were validated against the page and cached in .yosoi/selectors/

  5. Data was extracted using the cached selectors and printed to your terminal

Run the same command again and step 2–3 are skipped entirely. Yosoi reads from the local cache with no LLM call and no cost.

FAQs

What if selector discovery fails?

Check that your API key is valid and that the target URL is publicly accessible. Run with --debug to capture the extracted HTML and inspect what was sent to the LLM.

How do I force re-discovery for a domain I have already cached?

Pass --force (or -F) to the CLI, or set force=True on the Pipeline constructor. You can also delete the corresponding file from .yosoi/selectors/ and run again.

Can I use pip instead of uv?

Yes. pip install yosoi works fine. The docs use uv because it is faster and handles virtual environments automatically, but there is no hard dependency on it.

How do I switch providers?

Pass --model groq:llama-3.3-70b-versatile (or any provider:model string) to the CLI, or set YOSOI_MODEL in your .env file.

References

uv. Astral. Python package and project manager. https://docs.astral.sh/uv/

QScrape. Cascading Labs. Purpose-built fictional scraping targets for benchmarking and testing. https://qscrape.dev