Multi-Item Extraction

Yosoi handles pages where a selector matches many elements — product cards, article lists, search results — by detecting a repeating container and yielding one item per match.

Auto-discovery

The default: pass a Contract and let the AI find the container.

import yosoi as ys

class Product(ys.Contract):
    name: str = ys.Title()
    price: float = ys.Price()
    rating: str = ys.Rating()

pipeline = ys.Pipeline(ys.auto_config(), contract=Product)

async for item in pipeline.scrape('https://books.toscrape.com'):
    print(item.get('name'), item.get('price'))

scrape() is an async generator that yields one ContentMap per item found on the page.

Pinning the Container

If the AI guesses wrong, or you already know the wrapper element, set root on the contract:

class Product(ys.Contract):
    root = ys.css('article.product_pod')

    name: str = ys.Title()
    price: float = ys.Price(hint='Includes £ symbol')

root takes precedence over whatever the AI discovers. Use this when you want deterministic behaviour across runs.

Single-Item Pages

scrape() works the same way on detail pages; it just yields one item. No special handling needed.

class BookDetail(ys.Contract):
    title: str = ys.Title()
    price: float = ys.Price(hint='Includes £ symbol')
    availability: str = ys.Field(description='Stock availability status')

async for item in pipeline.scrape('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'):
    print(item)  # yields exactly one item

Saving Output

Pass output_format='json' to persist results. Multi-item pages are saved as {"items": [...]}.

pipeline = ys.Pipeline(config, contract=Product, output_format='json')

FAQs

How does the AI determine the container selector?

It analyzes the page HTML for repeating structural patterns. If the same element type appears multiple times with consistent child structure, it is treated as the container. Providing a hint on your fields improves accuracy.

What if some items are missing fields?

Missing fields return None by default. Annotate the field as T | None in your contract to make this explicit and avoid validation errors.

Can I scrape both a listing page and its detail pages in one pass?

Not directly in a single scrape() call. The typical pattern is to extract URLs from the listing page, then call scrape() again for each detail URL.

What if the AI picks the wrong container and I get one giant item instead of many?

Pin the container with root = ys.css('your.selector') as shown above. Use --debug to inspect the extracted HTML and identify the correct selector.