Concurrent Scraping

Pipeline.process_urls() accepts a list of URLs and a workers argument. When workers > 1, a Rich^△ Live progress table appears automatically. No extra setup required.

Basic Usage

import asyncio
import yosoi as ys
from yosoi import Pipeline
from yosoi.utils.files import init_yosoi, is_initialized

URLS = [
    'https://news.ycombinator.com',
    'https://lobste.rs',
    'https://thenewstack.io',
]

async def main() -> None:
    if not is_initialized():
        init_yosoi()

    config = ys.auto_config()  # picks up YOSOI_MODEL / provider keys from .env
    pipeline = Pipeline(config, contract=ys.NewsArticle)

    results = await pipeline.process_urls(URLS, workers=3)
    print(f'Done: {len(results["successful"])} succeeded, {len(results["failed"])} failed')

asyncio.run(main())

Result Shape

process_urls() returns a dict with three keys:

Key	Value
`successful`	List of URLs that completed without error
`failed`	List of URLs that raised an exception
`skipped`	List of URLs that were skipped (concurrent mode only)

Notes

When workers > 1, Yosoi dispatches each URL as an independent task via a TaskIQ^◑ in-memory broker. Workers run as concurrent asyncio^○ tasks inside the same event loop — no separate processes or external queue required.
Selector discovery (the one-time LLM call) is also parallelised across workers.
Cached domains skip discovery entirely; only extraction runs.
workers=1 (the default) runs sequentially with no progress table.

FAQs

How many workers should I use?

Start with 3 to 5. Higher counts can trigger rate limiting from both the target site and your LLM provider. The right number depends on your provider’s rate limits and the target site’s tolerance.

What happens if one URL fails?

It is added to the failed list and the rest continue. Exceptions are caught per-URL and do not abort the batch.

Does concurrency affect selector discovery?

Yes, in a good way. If multiple URLs share a domain that has not been discovered yet, one worker runs discovery while the others wait. Once cached, all workers use the result.

Can I retry failed URLs automatically?

Not built-in. After process_urls() returns, pass results["failed"] back into another process_urls() call to retry.

References

△ Rich. Will McGugan. Python library for rich text and progress displays in the terminal. https://rich.readthedocs.io/

○ asyncio. Python Software Foundation. Asynchronous I/O framework in the Python standard library. https://docs.python.org/3/library/asyncio.html

◑ TaskIQ. TaskIQ Contributors. Distributed task queue for Python with asyncio support, used internally as the broker for concurrent URL processing. https://taskiq-python.github.io/