Classes
Generated from yosoi
v0.0.1a11. Only symbols in__all__are listed.
Contract
Base class for user-defined scraping contracts.
define
define(name: str) -> ContractBuilder
Start a fluent ContractBuilder for the given contract name.
discovery_field_names
discovery_field_names() -> set[str]
Return the set of flattened field names used for discovery and cache keys.
Non-Contract fields keep their original name; nested Contract fields are
expanded to {parent}_{child} keys. This matches the key format used
by snapshots, field_descriptions(), and get_selector_overrides().
field_descriptions
field_descriptions() -> dict[str, str]
Return a mapping of field name to description, excluding selector overrides.
Nested Contract-typed fields are expanded to flat {parent}_{child} keys.
When the child contract has a pinned root, the description includes a scoping hint.
When the child has root = ys.discover(), a co-location hint is added.
field_hints
field_hints() -> dict[str, str | None]
Return yosoi_hint per (flat) field name, expanding nested contracts to {parent}_{child}.
generate_manifest
generate_manifest() -> str
Return a markdown table documenting all contract fields and their config.
get_root
get_root() -> SelectorEntry | None
Return the root selector if explicitly set on the contract class.
Returns: SelectorEntry | None — SelectorEntry for the repeating container element, or None.
get_selector_overrides
get_selector_overrides() -> dict[str, dict[str, str]]
Return selector overrides defined on fields via yosoi_selector.
Returns: dict[str, dict[str, str]] — Mapping of field name to selector dict (e.g. {"primary": "h1.title"}). dict[str, dict[str, str]] — Nested contract overrides use flat {parent}_{child} keys.
is_grouped
is_grouped() -> bool
Return True if the contract explicitly configures multi-item mode.
list_fields
list_fields() -> dict[str, type]
Return {field_name: inner_type} for fields annotated as list[T].
nested_contracts
nested_contracts() -> dict[str, type[Contract]]
Return a mapping of field name → child Contract class for Contract-typed fields.
to_selector_model
to_selector_model() -> type[BaseModel]
Generate a Pydantic model mapping each contract field to FieldSelectors.
This ensures that the LLM agent knows exactly which fields to find selectors for,
preserving any descriptions or hints provided in the contract.
Fields with a yosoi_selector override are excluded — their selectors are
provided directly and do not require AI discovery.
Nested Contract-typed fields are expanded to flat {parent}_{child} entries.
DebugConfig
Configuration for debug output.
JobPosting
Contract for job listing pages.
LLMConfig
Base configuration for any LLM provider.
NewsArticle
Default contract matching the original 5-field behavior.
Pipeline
Main pipeline for discovering and saving CSS selectors with retry logic.
Fetches HTML, cleans it, runs LLM-based selector discovery, then verifies and stores the selectors.
normalize_url
normalize_url(url: str) -> str
Add protocol to URL, preferring https. Args:
urlstr— The URL that is being fetched
Returns: str — The complete URL
process_url
process_url(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> None
Process a single URL: discover, verify, and save selectors.
Thin wrapper around :meth:scrape that drains the generator.
Raises on failure — callers are responsible for error handling.
Args:
urlstr— URL to processforcebool | None— Force re-discovery even if selectors exist. Defaults to False.skip_verificationbool— Skip verification step. Defaults to False.fetcher_typestr— Type of fetcher (‘simple’). Defaults to ‘simple’.max_fetch_retriesint— Maximum fetch retry attempts. Defaults to 2.max_discovery_retriesint— Maximum AI discovery retry attempts. Defaults to 3.output_formatstr | list[str] | None— Format(s) for extracted content. Defaults to None (uses pipeline default).
process_urls
process_urls(urls: list[str], force: bool | None = None, skip_verification: bool = False, fetcher_type: str = 'simple', max_fetch_retries: int = 2, max_discovery_retries: int = 3, output_format: str | list[str] | None = None, workers: int = 1, on_complete: Callable[[str, bool, float], Awaitable[None]] | None = None, on_start: Callable[[str], Awaitable[None]] | None = None) -> dict[str, list[str]]
Process multiple URLs and collect results.
When workers > 1 and there are multiple URLs, processing runs
concurrently via the taskiq broker. Otherwise URLs are processed
sequentially.
Args:
urlslist[str]— List of URLs to process.forcebool | None— Force re-discovery even if selectors exist. Defaults to False.skip_verificationbool— Skip verification step. Defaults to False.fetcher_typestr— Type of fetcher (‘simple’). Defaults to ‘simple’.max_fetch_retriesint— Maximum fetch retry attempts. Defaults to 2.max_discovery_retriesint— Maximum AI discovery retry attempts. Defaults to 3.output_formatstr | list[str] | None— Format(s) for extracted content. Defaults to None (uses pipeline default).workersint— Number of concurrent workers. Defaults to 1 (sequential).on_completeCallable[[str, bool, float], Awaitable[None]] | None— Optional async callback(url, success, elapsed)called after each URL finishes. Used by the CLI for live progress display.on_startCallable[[str], Awaitable[None]] | None— Optional async callback(url)called just before each URL begins processing.
Returns: dict[str, list[str]] — Dictionary with keys:
- ‘successful’: List of successfully processed URLs
- ‘failed’: List of URLs that failed processing
- ‘skipped’: List of URLs skipped (concurrent only)
scrape
scrape(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> AsyncIterator[ContentMap]
Async generator yielding individual content items from a URL.
Canonical entry point — handles both cached-selector replay and fresh
AI discovery. For multi-item pages (catalogs, listings), yields one
ContentMap per matched container element. For single-item pages,
yields exactly one ContentMap.
Args:
urlstr— URL to processforcebool | None— Force re-discovery even if selectors exist. Defaults to the pipeline-levelforceflag.max_fetch_retriesint— Maximum fetch retry attempts. Defaults to 2.max_discovery_retriesint— Maximum AI discovery retry attempts. Defaults to 3.skip_verificationbool— Skip verification step. Defaults to False.fetcher_typestr— Type of fetcher (‘simple’). Defaults to ‘simple’.output_formatstr | list[str] | None— Format(s) for saving extracted content.
Yields: AsyncIterator[ContentMap] — ContentMap dicts — one per extracted item.
show_llm_stats
show_llm_stats() -> None
Show LLM usage statistics.
show_summary
show_summary() -> None
Show summary of all saved selectors.
TelemetryConfig
Configuration for observability / telemetry.
Video
Contract for video pages (YouTube-style).
YosoiConfig
Top-level Yosoi configuration bundling LLM, debug, and telemetry settings.
Example::
config = YosoiConfig( llm=ys.groq('llama-3.3-70b-versatile', api_key), debug=DebugConfig(save_html=False),)pipeline = Pipeline(config, contract=MyContract)validate_api_key_env
validate_api_key_env() -> YosoiConfig
Resolve API key for the selected provider, falling back to others if needed.
Resolution order:
- Use api_key if already set on the LLMConfig.
- Try the environment variable for the selected provider.
- Walk PROVIDER_FALLBACK_ORDER and use the first provider with a key.
- Raise if nothing works.