create_vector_db_from_dir

create_vector_db_from_dir(
    dir_path: str | list[str],
    max_tokens: int = 4000,
    client: API = None,
    db_path: str = 'tmp/chromadb.db',
    collection_name: str = 'all-my-documents',
    get_or_create: bool = False,
    chunk_mode: str = 'multi_lines',
    must_break_at_empty_line: bool = True,
    embedding_model: str = 'all-MiniLM-L6-v2',
    embedding_function: Callable = None,
    custom_text_split_function: Callable = None,
    custom_text_types: list[str] = ['txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml', 'pdf', 'mdx'],
    recursive: bool = True,
    extra_docs: bool = False
) -> API

Create a vector db from all the files in a given directory, the directory can also be a single file or a url to a single file. We support chromadb compatible APIs to create the vector db, this function is not required if you prepared your own vector db.

Parameters:
NameDescription
dir_paththe path to the directory, file, url or a list of them.

Type: str | list[str]
max_tokensthe maximum number of tokens per chunk.

Default is 4000.

Type: int

Default: 4000
clientthe chromadb client.

Default is None.

Type: API

Default: None
db_paththe path to the chromadb.

Default is “tmp/chromadb.db”.

The default was /tmp/chromadb.db for version =0.2.24.

Type: str

Default: ‘tmp/chromadb.db’
collection_namethe name of the collection.

Default is “all-my-documents”.

Type: str

Default: ‘all-my-documents’
get_or_createWhether to get or create the collection.

Default is False.

If True, the collection will be returned if it already exists.

Will raise ValueError if the collection already exists and get_or_create is False.

Type: bool

Default: False
chunk_modethe chunk mode.

Default is “multi_lines”.

Type: str

Default: ‘multi_lines’
must_break_at_empty_lineWhether to break at empty line.

Default is True.

Type: bool

Default: True
embedding_modelthe embedding model to use.

Default is “all-MiniLM-L6-v2”.

Will be ignored if embedding_function is not None.

Type: str

Default: ‘all-MiniLM-L6-v2’
embedding_functionthe embedding function to use.

Default is None, SentenceTransformer with the given embedding_model will be used.

If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in https://docs.trychroma.com/embeddings.

Type: Callable

Default: None
custom_text_split_functiona custom function to split a string into a list of strings.

Default is None, will use the default function in autogen.retrieve_utils.split_text_to_chunks.

Type: Callable

Default: None
custom_text_typesa list of file types to be processed.

Default is TEXT_FORMATS.

Type: list[str]

Default: [‘txt’, ‘json’, ‘csv’, ‘tsv’, ‘md’, ‘html’, ‘htm’, ‘rtf’, ‘rst’, ‘jsonl’, ‘log’, ‘xml’, ‘yaml’, ‘yml’, ‘pdf’, ‘mdx’]
recursivewhether to search documents recursively in the dir_path.

Default is True.

Type: bool

Default: True
extra_docswhether to add more documents in the collection.

Default is False

Type: bool

Default: False
Returns:
TypeDescription
APIThe chromadb client.