autogen.retrieve_utils.create_vector_db_from_dir

create_vector_db_from_dir

create_vector_db_from_dir(
    dir_path: str | list[str],
    max_tokens: int = 4000,
    client: API = None,
    db_path: str = 'tmp/chromadb.db',
    collection_name: str = 'all-my-documents',
    get_or_create: bool = False,
    chunk_mode: str = 'multi_lines',
    must_break_at_empty_line: bool = True,
    embedding_model: str = 'all-MiniLM-L6-v2',
    embedding_function: Callable = None,
    custom_text_split_function: Callable = None,
    custom_text_types: list[str] = ['txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml', 'pdf', 'mdx'],
    recursive: bool = True,
    extra_docs: bool = False
) -> API

Create a vector db from all the files in a given directory, the directory can also be a single file or a url to a single file. We support chromadb compatible APIs to create the vector db, this function is not required if you prepared your own vector db.

Parameters:

Name	Description
`dir_path`	the path to the directory, file, url or a list of them. Type: str \| list[str]
`max_tokens`	the maximum number of tokens per chunk. Default is 4000. Type: int Default: 4000
`client`	the chromadb client. Default is None. Type: API Default: None
`db_path`	the path to the chromadb. Default is “tmp/chromadb.db”. The default was `/tmp/chromadb.db` for version `=0.2.24`. Type: str Default: ‘tmp/chromadb.db’
`collection_name`	the name of the collection. Default is “all-my-documents”. Type: str Default: ‘all-my-documents’
`get_or_create`	Whether to get or create the collection. Default is False. If True, the collection will be returned if it already exists. Will raise ValueError if the collection already exists and get_or_create is False. Type: bool Default: False
`chunk_mode`	the chunk mode. Default is “multi_lines”. Type: str Default: ‘multi_lines’
`must_break_at_empty_line`	Whether to break at empty line. Default is True. Type: bool Default: True
`embedding_model`	the embedding model to use. Default is “all-MiniLM-L6-v2”. Will be ignored if embedding_function is not None. Type: str Default: ‘all-MiniLM-L6-v2’
`embedding_function`	the embedding function to use. Default is None, SentenceTransformer with the given `embedding_model` will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in `https://docs.trychroma.com/embeddings`. Type: Callable Default: None
`custom_text_split_function`	a custom function to split a string into a list of strings. Default is None, will use the default function in `autogen.retrieve_utils.split_text_to_chunks`. Type: Callable Default: None
`custom_text_types`	a list of file types to be processed. Default is TEXT_FORMATS. Type: list[str] Default: [‘txt’, ‘json’, ‘csv’, ‘tsv’, ‘md’, ‘html’, ‘htm’, ‘rtf’, ‘rst’, ‘jsonl’, ‘log’, ‘xml’, ‘yaml’, ‘yml’, ‘pdf’, ‘mdx’]
`recursive`	whether to search documents recursively in the dir_path. Default is True. Type: bool Default: True
`extra_docs`	whether to add more documents in the collection. Default is False Type: bool Default: False

Returns:

Type	Description
API	The chromadb client.

API Reference

​create_vector_db_from_dir

create_vector_db_from_dir