retrieve_utils
autogen.retrieve_utils.create_vector_db_from_dir
create_vector_db_from_dir
Create a vector db from all the files in a given directory, the directory can also be a single file or a url to
a single file. We support chromadb compatible APIs to create the vector db, this function is not required if
you prepared your own vector db.
Name | Description |
---|---|
dir_path | the path to the directory, file, url or a list of them. Type: str | list[str] |
max_tokens | the maximum number of tokens per chunk. Default is 4000. Type: int Default: 4000 |
client | the chromadb client. Default is None. Type: API Default: None |
db_path | the path to the chromadb. Default is “tmp/chromadb.db”. The default was /tmp/chromadb.db for version =0.2.24 .Type: str Default: ‘tmp/chromadb.db’ |
collection_name | the name of the collection. Default is “all-my-documents”. Type: str Default: ‘all-my-documents’ |
get_or_create | Whether to get or create the collection. Default is False. If True, the collection will be returned if it already exists. Will raise ValueError if the collection already exists and get_or_create is False. Type: bool Default: False |
chunk_mode | the chunk mode. Default is “multi_lines”. Type: str Default: ‘multi_lines’ |
must_break_at_empty_line | Whether to break at empty line. Default is True. Type: bool Default: True |
embedding_model | the embedding model to use. Default is “all-MiniLM-L6-v2”. Will be ignored if embedding_function is not None. Type: str Default: ‘all-MiniLM-L6-v2’ |
embedding_function | the embedding function to use. Default is None, SentenceTransformer with the given embedding_model will be used.If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in https://docs.trychroma.com/embeddings .Type: Callable Default: None |
custom_text_split_function | a custom function to split a string into a list of strings. Default is None, will use the default function in autogen.retrieve_utils.split_text_to_chunks .Type: Callable Default: None |
custom_text_types | a list of file types to be processed. Default is TEXT_FORMATS. Type: list[str] Default: [‘txt’, ‘json’, ‘csv’, ‘tsv’, ‘md’, ‘html’, ‘htm’, ‘rtf’, ‘rst’, ‘jsonl’, ‘log’, ‘xml’, ‘yaml’, ‘yml’, ‘pdf’, ‘mdx’] |
recursive | whether to search documents recursively in the dir_path. Default is True. Type: bool Default: True |
extra_docs | whether to add more documents in the collection. Default is False Type: bool Default: False |
Type | Description |
---|---|
API | The chromadb client. |