Crawler#
- class ocdsindex.crawler.Crawler(directory, base_url, extract, *, allow=<function true>)[source]#
Crawls a directory for documents to index.
- __init__(directory, base_url, extract, *, allow=<function true>)[source]#
- Parameters:
directory (str) – the directory to crawl
base_url (str) – the remote URL at which the files will be available
extract – a function that accepts a file’s remote URL and its root HTML element, and returns the documents to index as a list of dicts
allow – a function that accepts a directory path and a file basename, and returns whether to crawl the file as a boolean