Ingest documents into an embedding store
Currently supports text documents (TXT, HTML, Markdown).
type: "io.kestra.plugin.ai.rag.IngestDocument"
Examples
Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.
id: document_ingestion
namespace: company.ai
tasks:
- id: ingest
type: io.kestra.plugin.ai.rag.IngestDocument
provider:
type: io.kestra.plugin.ai.provider.GoogleGemini
modelName: gemini-embedding-exp-03-07
apiKey: "{{ kv('GEMINI_API_KEY') }}"
embeddings:
type: io.kestra.plugin.ai.embeddings.KestraKVStore
drop: true
fromExternalURLs:
- https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md
Properties
embeddings *RequiredNon-dynamicChromaElasticsearchKestraKVStoreMilvusMongoDBAtlasPGVectorPineconeQdrantRedisWeaviate
Embedding store provider
provider *RequiredNon-dynamicAmazonBedrockAnthropicAzureOpenAIDeepSeekGoogleGeminiGoogleVertexAIMistralAIOllamaOpenAI
Language model provider
Must be configured with an embedding model.
documentSplitter Non-dynamicIngestDocument-DocumentSplitter
Document splitter
drop booleanstring
false
Drop the store before ingestion (useful for testing)
fromExternalURLs array
List of document URLs from external sources
fromInternalURIs array
List of internal storage URIs for documents
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}
.
fromPath string
Path in the task working directory containing documents to ingest
Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).
metadata object
Additional metadata to add to all ingested documents
Outputs
embeddingStoreOutputs object
Additional outputs from the embedding store
ingestedDocuments integer
Number of ingested documents
inputTokenCount integer
Input token count
outputTokenCount integer
Output token count
totalTokenCount integer
Total token count
Definitions
Azure OpenAI Model Provider
endpoint *Requiredstring
API endpoint
The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/
modelName *Requiredstring
Model name
type *Requiredobject
apiKey string
API Key
clientId string
Client ID
clientSecret string
Client secret
serviceVersion string
API version
tenantId string
Tenant ID
PGVector Embedding Store
database *Requiredstring
The database name
host *Requiredstring
The database server host
password *Requiredstring
The database password
port *Requiredintegerstring
The database server port
table *Requiredstring
The table to store embeddings in
type *Requiredobject
user *Requiredstring
The database user
useIndex booleanstring
false
Whether to use use an IVFFlat index
An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).
Qdrant Embedding Store
apiKey *Requiredstring
The API key
collectionName *Requiredstring
The collection name
host *Requiredstring
The database server host
port *Requiredintegerstring
The database server port
type *Requiredobject
Google VertexAI Model Provider
endpoint *Requiredstring
Endpoint URL
location *Requiredstring
Project location
modelName *Requiredstring
Model name
project *Requiredstring
Project ID
type *Requiredobject
Google Gemini Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
MongoDB Atlas Embedding Store
collectionName *Requiredstring
The collection name
host *Requiredstring
The host
indexName *Requiredstring
The index name
scheme *Requiredstring
The scheme (e.g., mongodb+srv)
type *Requiredobject
createIndex booleanstring
Create the index
database string
The database
metadataFieldNames array
The metadata field names
options object
The connection string options
password string
The password
username string
The username
Mistral AI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
API base URL
In-memory embedding store that stores data as Kestra KV pairs
type *Requiredobject
kvName string
{{flow.id}}-embedding-store
The name of the KV pair to use
Chroma Embedding Store
baseUrl *Requiredstring
The database base URL
collectionName *Requiredstring
The collection name
type *Requiredobject
Redis Embedding Store
host *Requiredstring
The database server host
port *Requiredintegerstring
The database server port
type *Requiredobject
indexName string
embedding-index
The index name
io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth
password string
Basic authorization password
username string
Basic authorization username
Milvus Embedding Store
token *Requiredstring
Token
Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.
type *Requiredobject
autoFlushOnDelete booleanstring
Auto flush on delete
If true, flush after delete operations.
autoFlushOnInsert booleanstring
Auto flush on insert
If true, flush after insert operations. Setting it to false can improve throughput.
collectionName string
Collection name
Target collection. Created automatically if it does not exist. Default: "default".
consistencyLevel string
Consistency level
Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).
databaseName string
Database name
Logical database to use. If not provided, the default database is used.
host string
Host
Milvus host name (used when uri
is not set). Default: "localhost".
idFieldName string
ID field name
Field name for document IDs. Default depends on collection schema.
indexType string
Index type
Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.
metadataFieldName string
Metadata field name
Field name for metadata. Default depends on collection schema.
metricType string
Metric type
Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.
password string
Password
Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md
port integerstring
Port
Milvus port (used when uri
is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.
retrieveEmbeddingsOnSearch booleanstring
Retrieve embeddings on search
If true, return stored embeddings along with matches. Default: false.
textFieldName string
Text field name
Field name for original text. Default depends on collection schema.
uri string
URI
Connection URI. Use either uri
OR host
/port
(not both).
Examples:
- gRPC (typical): "milvus://host: 19530"
- HTTP: "http://host: 9091"
username string
Username
Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md
vectorFieldName string
Vector field name
Field name for the embedding vector. Must match the index definition and embedding dimensionality.
Deepseek Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://api.deepseek.com/v1
API base URL
Pinecone Embedding Store
apiKey *Requiredstring
The API key
cloud *Requiredstring
The cloud provider
index *Requiredstring
The index
region *Requiredstring
The cloud provider region
type *Requiredobject
namespace string
The namespace (default will be used if not provided)
Anthropic AI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
io.kestra.plugin.ai.rag.IngestDocument-DocumentSplitter
maxOverlapSizeInChars *Requiredinteger
Maximum overlap size (characters). Only full sentences are considered for overlap.
maxSegmentSizeInChars *Requiredinteger
Maximum segment size (characters)
splitter string
RECURSIVE
RECURSIVE
PARAGRAPH
LINE
SENTENCE
WORD
DocumentSplitter type
Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.
Weaviate Embedding Store
apiKey *Requiredstring
API key
Weaviate API key. Omit for local deployments without auth.
host *Requiredstring
Host
Cluster host name without protocol, e.g., "abc123.weaviate.network".
type *Requiredobject
avoidDups booleanstring
Avoid duplicates
If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.
consistencyLevel string
ONE
QUORUM
ALL
Consistency level
Write consistency: ONE, QUORUM (default), or ALL.
grpcPort integerstring
gRPC port
Port for gRPC if enabled (e.g., 50051).
metadataFieldName string
Metadata field name
Field used to store metadata. Defaults to "_metadata" if not set.
metadataKeys array
Metadata keys
The list of metadata keys to store - if not provided, it will default to an empty list.
objectClass string
Object class
Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.
port integerstring
Port
Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.
scheme string
Scheme
Cluster scheme: "https" (recommended) or "http".
securedGrpc booleanstring
Secure gRPC
Whether the gRPC connection is secured (TLS).
useGrpcForInserts booleanstring
Use gRPC for batch inserts
If true, use gRPC for batch inserts. HTTP remains required for search operations.
Ollama Model Provider
endpoint *Requiredstring
Model endpoint
modelName *Requiredstring
Model name
type *Requiredobject
OpenAI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
API base URL
io.kestra.plugin.ai.rag.IngestDocument-InlineDocument
content *Requiredstring
Document content
metadata object
Document metadata
io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection
hosts *Requiredarray
1
List of HTTP Elasticsearch servers
Must be a URI like https://example.com: 9200
with scheme and port
basicAuth Elasticsearch-ElasticsearchConnection-BasicAuth
Basic authorization configuration
headers array
List of HTTP headers to be sent with every request
Each item is a key: value
string, e.g., Authorization: Token XYZ
pathPrefix string
Path prefix for all HTTP requests
If set to /my/path
, each client request becomes /my/path/
+ endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.
strictDeprecationMode booleanstring
Treat responses with deprecation warnings as failures
trustAllSsl booleanstring
Trust all SSL CA certificates
Use this if the server uses a self-signed SSL certificate
Elasticsearch Embedding Store
connection *RequiredElasticsearch-ElasticsearchConnection
indexName *Requiredstring
The name of the index to store embeddings
type *Requiredobject
Amazon Bedrock Model Provider
accessKeyId *Requiredstring
AWS Access Key ID
modelName *Requiredstring
Model name
secretAccessKey *Requiredstring
AWS Secret Access Key
type *Requiredobject
modelType string
COHERE
COHERE
TITAN
Amazon Bedrock Embedding Model Type