create_qdrant_from_dir

create_qdrant_from_dir(
    dir_path: str,
    max_tokens: int = 4000,
    client: QdrantClient = None,
    collection_name: str = 'all-my-documents',
    chunk_mode: str = 'multi_lines',
    must_break_at_empty_line: bool = True,
    embedding_model: str = 'BAAI/bge-small-en-v1.5',
    custom_text_split_function: Callable = None,
    custom_text_types: list[str] = ['txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml', 'pdf', 'mdx'],
    recursive: bool = True,
    extra_docs: bool = False,
    parallel: int = 0,
    on_disk: bool = False,
    quantization_config: ForwardRef('QuantizationConfig') | None = None,
    hnsw_config: ForwardRef('HnswConfigDiff') | None = None,
    payload_indexing: bool = False,
    qdrant_client_options: dict[str, Any] | None = {}
) -> 

Create a Qdrant collection from all the files in a given directory, the directory can also be a single file or a url to a single file.

Parameters:
NameDescription
dir_paththe path to the directory, file or url.

Type: str
max_tokensthe maximum number of tokens per chunk.

Default is 4000.

Type: int

Default: 4000
clientthe QdrantClient instance.

Default is None.

Type: QdrantClient

Default: None
collection_namethe name of the collection.

Default is “all-my-documents”.

Type: str

Default: ‘all-my-documents’
chunk_modethe chunk mode.

Default is “multi_lines”.

Type: str

Default: ‘multi_lines’
must_break_at_empty_lineWhether to break at empty line.

Default is True.

Type: bool

Default: True
embedding_modelthe embedding model to use.

Default is “BAAI/bge-small-en-v1.5”.

The list of all the available models can be at https://qdrant.github.io/fastembed/examples/Supported_Models/.

Type: str

Default: ‘BAAI/bge-small-en-v1.5’
custom_text_split_functiona custom function to split a string into a list of strings.

Default is None, will use the default function in autogen.retrieve_utils.split_text_to_chunks.

Type: Callable

Default: None
custom_text_typesa list of file types to be processed.

Default is TEXT_FORMATS.

Type: list[str]

Default: [‘txt’, ‘json’, ‘csv’, ‘tsv’, ‘md’, ‘html’, ‘htm’, ‘rtf’, ‘rst’, ‘jsonl’, ‘log’, ‘xml’, ‘yaml’, ‘yml’, ‘pdf’, ‘mdx’]
recursivewhether to search documents recursively in the dir_path.

Default is True.

Type: bool

Default: True
extra_docswhether to add more documents in the collection.

Default is False

Type: bool

Default: False
parallelHow many parallel workers to use for embedding.

Defaults to the number of CPU cores

Type: int

Default: 0
on_diskWhether to store the collection on disk.

Default is False.

Type: bool

Default: False
quantization_configQuantization configuration.

If None, quantization will be disabled.

Ref: https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection

Type: ForwardRef(‘models.QuantizationConfig’) | None

Default: None
hnsw_configHNSW configuration.

If None, default configuration will be used.

Ref: https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection

Type: ForwardRef(‘models.HnswConfigDiff’) | None

Default: None
payload_indexingWhether to create a payload index for the document field.

Default is False.

Type: bool

Default: False
qdrant_client_options(Optional, dict): the options for instantiating the qdrant client.

Ref: https://github.com/qdrant/qdrant-client/blob/master/qdrant_client/qdrant_client.py#L36-L58.

Type: dict[str, typing.Any] | None

Default: {}