Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Classes

Generated from yosoi v0.0.1a11. Only symbols in __all__ are listed.

Contract

Base class for user-defined scraping contracts.

define

define(name: str) -> ContractBuilder

Start a fluent ContractBuilder for the given contract name.

discovery_field_names

discovery_field_names() -> set[str]

Return the set of flattened field names used for discovery and cache keys.

Non-Contract fields keep their original name; nested Contract fields are expanded to {parent}_{child} keys. This matches the key format used by snapshots, field_descriptions(), and get_selector_overrides().

field_descriptions

field_descriptions() -> dict[str, str]

Return a mapping of field name to description, excluding selector overrides.

Nested Contract-typed fields are expanded to flat {parent}_{child} keys. When the child contract has a pinned root, the description includes a scoping hint. When the child has root = ys.discover(), a co-location hint is added.

field_hints

field_hints() -> dict[str, str | None]

Return yosoi_hint per (flat) field name, expanding nested contracts to {parent}_{child}.

generate_manifest

generate_manifest() -> str

Return a markdown table documenting all contract fields and their config.

get_root

get_root() -> SelectorEntry | None

Return the root selector if explicitly set on the contract class. Returns: SelectorEntry | None — SelectorEntry for the repeating container element, or None.

get_selector_overrides

get_selector_overrides() -> dict[str, dict[str, str]]

Return selector overrides defined on fields via yosoi_selector. Returns: dict[str, dict[str, str]] — Mapping of field name to selector dict (e.g. {"primary": "h1.title"}). dict[str, dict[str, str]] — Nested contract overrides use flat {parent}_{child} keys.

is_grouped

is_grouped() -> bool

Return True if the contract explicitly configures multi-item mode.

list_fields

list_fields() -> dict[str, type]

Return {field_name: inner_type} for fields annotated as list[T].

nested_contracts

nested_contracts() -> dict[str, type[Contract]]

Return a mapping of field name → child Contract class for Contract-typed fields.

to_selector_model

to_selector_model() -> type[BaseModel]

Generate a Pydantic model mapping each contract field to FieldSelectors.

This ensures that the LLM agent knows exactly which fields to find selectors for, preserving any descriptions or hints provided in the contract. Fields with a yosoi_selector override are excluded — their selectors are provided directly and do not require AI discovery. Nested Contract-typed fields are expanded to flat {parent}_{child} entries.

DebugConfig

Configuration for debug output.

JobPosting

Contract for job listing pages.

LLMConfig

Base configuration for any LLM provider.

NewsArticle

Default contract matching the original 5-field behavior.

Pipeline

Main pipeline for discovering and saving CSS selectors with retry logic.

Fetches HTML, cleans it, runs LLM-based selector discovery, then verifies and stores the selectors.

normalize_url

normalize_url(url: str) -> str

Add protocol to URL, preferring https. Args:

  • url str — The URL that is being fetched

Returns: str — The complete URL

process_url

process_url(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> None

Process a single URL: discover, verify, and save selectors.

Thin wrapper around :meth:scrape that drains the generator. Raises on failure — callers are responsible for error handling. Args:

  • url str — URL to process
  • force bool | None — Force re-discovery even if selectors exist. Defaults to False.
  • skip_verification bool — Skip verification step. Defaults to False.
  • fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
  • max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
  • max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
  • output_format str | list[str] | None — Format(s) for extracted content. Defaults to None (uses pipeline default).

process_urls

process_urls(urls: list[str], force: bool | None = None, skip_verification: bool = False, fetcher_type: str = 'simple', max_fetch_retries: int = 2, max_discovery_retries: int = 3, output_format: str | list[str] | None = None, workers: int = 1, on_complete: Callable[[str, bool, float], Awaitable[None]] | None = None, on_start: Callable[[str], Awaitable[None]] | None = None) -> dict[str, list[str]]

Process multiple URLs and collect results.

When workers > 1 and there are multiple URLs, processing runs concurrently via the taskiq broker. Otherwise URLs are processed sequentially. Args:

  • urls list[str] — List of URLs to process.
  • force bool | None — Force re-discovery even if selectors exist. Defaults to False.
  • skip_verification bool — Skip verification step. Defaults to False.
  • fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
  • max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
  • max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
  • output_format str | list[str] | None — Format(s) for extracted content. Defaults to None (uses pipeline default).
  • workers int — Number of concurrent workers. Defaults to 1 (sequential).
  • on_complete Callable[[str, bool, float], Awaitable[None]] | None — Optional async callback (url, success, elapsed) called after each URL finishes. Used by the CLI for live progress display.
  • on_start Callable[[str], Awaitable[None]] | None — Optional async callback (url) called just before each URL begins processing.

Returns: dict[str, list[str]] — Dictionary with keys:

  • ‘successful’: List of successfully processed URLs
  • ‘failed’: List of URLs that failed processing
  • ‘skipped’: List of URLs skipped (concurrent only)

scrape

scrape(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> AsyncIterator[ContentMap]

Async generator yielding individual content items from a URL.

Canonical entry point — handles both cached-selector replay and fresh AI discovery. For multi-item pages (catalogs, listings), yields one ContentMap per matched container element. For single-item pages, yields exactly one ContentMap. Args:

  • url str — URL to process
  • force bool | None — Force re-discovery even if selectors exist. Defaults to the pipeline-level force flag.
  • max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
  • max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
  • skip_verification bool — Skip verification step. Defaults to False.
  • fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
  • output_format str | list[str] | None — Format(s) for saving extracted content.

Yields: AsyncIterator[ContentMap] — ContentMap dicts — one per extracted item.

show_llm_stats

show_llm_stats() -> None

Show LLM usage statistics.

show_summary

show_summary() -> None

Show summary of all saved selectors.

TelemetryConfig

Configuration for observability / telemetry.

Video

Contract for video pages (YouTube-style).

YosoiConfig

Top-level Yosoi configuration bundling LLM, debug, and telemetry settings.

Example::

config = YosoiConfig(
llm=ys.groq('llama-3.3-70b-versatile', api_key),
debug=DebugConfig(save_html=False),
)
pipeline = Pipeline(config, contract=MyContract)

validate_api_key_env

validate_api_key_env() -> YosoiConfig

Resolve API key for the selected provider, falling back to others if needed.

Resolution order:

  1. Use api_key if already set on the LLMConfig.
  2. Try the environment variable for the selected provider.
  3. Walk PROVIDER_FALLBACK_ORDER and use the first provider with a key.
  4. Raise if nothing works.