Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Concurrent Scraping

Pipeline.process_urls() accepts a list of URLs and a workers argument. When workers > 1, a Rich Live progress table appears automatically. No extra setup required.

Basic Usage

import asyncio
import yosoi as ys
from yosoi import Pipeline
from yosoi.utils.files import init_yosoi, is_initialized
URLS = [
'https://news.ycombinator.com',
'https://lobste.rs',
'https://thenewstack.io',
]
async def main() -> None:
if not is_initialized():
init_yosoi()
config = ys.auto_config() # picks up YOSOI_MODEL / provider keys from .env
pipeline = Pipeline(config, contract=ys.NewsArticle)
results = await pipeline.process_urls(URLS, workers=3)
print(f'Done: {len(results["successful"])} succeeded, {len(results["failed"])} failed')
asyncio.run(main())

Result Shape

process_urls() returns a dict with three keys:

KeyValue
successfulList of URLs that completed without error
failedList of URLs that raised an exception
skippedList of URLs that were skipped (concurrent mode only)

Notes

  • When workers > 1, Yosoi dispatches each URL as an independent task via a TaskIQ in-memory broker. Workers run as concurrent asyncio tasks inside the same event loop — no separate processes or external queue required.
  • Selector discovery (the one-time LLM call) is also parallelised across workers.
  • Cached domains skip discovery entirely; only extraction runs.
  • workers=1 (the default) runs sequentially with no progress table.

FAQs

How many workers should I use?

Start with 3 to 5. Higher counts can trigger rate limiting from both the target site and your LLM provider. The right number depends on your provider’s rate limits and the target site’s tolerance.

What happens if one URL fails?

It is added to the failed list and the rest continue. Exceptions are caught per-URL and do not abort the batch.

Does concurrency affect selector discovery?

Yes, in a good way. If multiple URLs share a domain that has not been discovered yet, one worker runs discovery while the others wait. Once cached, all workers use the result.

Can I retry failed URLs automatically?

Not built-in. After process_urls() returns, pass results["failed"] back into another process_urls() call to retry.

References

Rich. Will McGugan. Python library for rich text and progress displays in the terminal. https://rich.readthedocs.io/

asyncio. Python Software Foundation. Asynchronous I/O framework in the Python standard library. https://docs.python.org/3/library/asyncio.html

TaskIQ. TaskIQ Contributors. Distributed task queue for Python with asyncio support, used internally as the broker for concurrent URL processing. https://taskiq-python.github.io/