IngestDocument

Ingest documents into an embedding store

Currently supports text documents (TXT, HTML, Markdown).

yaml
type: "io.kestra.plugin.ai.rag.IngestDocument"

Examples

Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.

yaml
id: document_ingestion
namespace: company.ai

tasks:
  - id: ingest
    type: io.kestra.plugin.ai.rag.IngestDocument
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md

Properties

embeddings *Chroma Elasticsearch KestraKVStore Milvus MongoDBAtlas PGVector Pinecone Qdrant Redis Weaviate

Embedding store provider

provider *AmazonBedrock Anthropic AzureOpenAI DeepSeek GoogleGemini GoogleVertexAI MistralAI Ollama OpenAI

Language model provider

Must be configured with an embedding model.

documentSplitter IngestDocument-DocumentSplitter

Document splitter

drop booleanstring

Default false

Drop the store before ingestion (useful for testing)

fromDocuments array

SubType

List of inline documents

fromExternalURLs array

SubType string

List of document URLs from external sources

fromInternalURIs array

SubType string

List of internal storage URIs for documents

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

fromPath string

Path in the task working directory containing documents to ingest

Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).

metadata object

SubType string

Additional metadata to add to all ingested documents

Outputs

embeddingStoreOutputs object

Additional outputs from the embedding store

ingestedDocuments integer

Number of ingested documents

inputTokenCount integer

Input token count

outputTokenCount integer

Output token count

totalTokenCount integer

Total token count

Definitions

Azure OpenAI Model Provider

endpoint *string

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

modelName *string

Model name

type *object

apiKey string

API Key

clientId string

Client ID

clientSecret string

Client secret

serviceVersion string

API version

tenantId string

Tenant ID

PGVector Embedding Store

database *string

The database name

host *string

The database server host

password *string

The database password

port *integerstring

The database server port

table *string

The table to store embeddings in

type *object

user *string

The database user

useIndex booleanstring

Default false

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

Qdrant Embedding Store

apiKey *string

The API key

collectionName *string

The collection name

host *string

The database server host

port *integerstring

The database server port

type *object

Google VertexAI Model Provider

endpoint *string

Endpoint URL

location *string

Project location

modelName *string

Model name

project *string

Project ID

type *object

Google Gemini Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

MongoDB Atlas Embedding Store

collectionName *string

The collection name

host *string

The host

indexName *string

The index name

scheme *string

The scheme (e.g., mongodb+srv)

type *object

createIndex booleanstring

Create the index

database string

The database

metadataFieldNames array

SubType string

The metadata field names

options object

The connection string options

password string

The password

username string

The username

Mistral AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

API base URL

In-memory embedding store that stores data as Kestra KV pairs

type *object

kvName string

Default {{flow.id}}-embedding-store

The name of the KV pair to use

Chroma Embedding Store

baseUrl *string

The database base URL

collectionName *string

The collection name

type *object

Redis Embedding Store

host *string

The database server host

port *integerstring

The database server port

type *object

indexName string

Default embedding-index

The index name

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth

password string

Basic authorization password

username string

Basic authorization username

Milvus Embedding Store

token *string

Token

Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.

type *object

autoFlushOnDelete booleanstring

Auto flush on delete

If true, flush after delete operations.

autoFlushOnInsert booleanstring

Auto flush on insert

If true, flush after insert operations. Setting it to false can improve throughput.

collectionName string

Collection name

Target collection. Created automatically if it does not exist. Default: "default".

consistencyLevel string

Consistency level

Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).

databaseName string

Database name

Logical database to use. If not provided, the default database is used.

host string

Host

Milvus host name (used when uri is not set). Default: "localhost".

idFieldName string

ID field name

Field name for document IDs. Default depends on collection schema.

indexType string

Index type

Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.

metadataFieldName string

Metadata field name

Field name for metadata. Default depends on collection schema.

metricType string

Metric type

Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.

password string

Password

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

port integerstring

Port

Milvus port (used when uri is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.

retrieveEmbeddingsOnSearch booleanstring

Retrieve embeddings on search

If true, return stored embeddings along with matches. Default: false.

textFieldName string

Text field name

Field name for original text. Default depends on collection schema.

uri string

URI

Connection URI. Use either uri OR host/port (not both). Examples:

gRPC (typical): "milvus://host: 19530"
HTTP: "http://host: 9091"

username string

Username

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

vectorFieldName string

Vector field name

Field name for the embedding vector. Must match the index definition and embedding dimensionality.

Deepseek Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://api.deepseek.com/v1

API base URL

Pinecone Embedding Store

apiKey *string

The API key

cloud *string

The cloud provider

index *string

The index

region *string

The cloud provider region

type *object

namespace string

The namespace (default will be used if not provided)

Anthropic AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

io.kestra.plugin.ai.rag.IngestDocument-DocumentSplitter

maxOverlapSizeInChars *integer

Maximum overlap size (characters). Only full sentences are considered for overlap.

maxSegmentSizeInChars *integer

Maximum segment size (characters)

splitter string

Default RECURSIVE

Possible Values

RECURSIVEPARAGRAPHLINESENTENCEWORD

DocumentSplitter type

Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.

Weaviate Embedding Store

apiKey *string

API key

Weaviate API key. Omit for local deployments without auth.

host *string

Host

Cluster host name without protocol, e.g., "abc123.weaviate.network".

type *object

avoidDups booleanstring

Avoid duplicates

If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.

consistencyLevel string

Possible Values

ONEQUORUMALL

Consistency level

Write consistency: ONE, QUORUM (default), or ALL.

grpcPort integerstring

gRPC port

Port for gRPC if enabled (e.g., 50051).

metadataFieldName string

Metadata field name

Field used to store metadata. Defaults to "_metadata" if not set.

metadataKeys array

SubType string

Metadata keys

The list of metadata keys to store - if not provided, it will default to an empty list.

objectClass string

Object class

Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.

port integerstring

Port

Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.

scheme string

Scheme

Cluster scheme: "https" (recommended) or "http".

securedGrpc booleanstring

Secure gRPC

Whether the gRPC connection is secured (TLS).

useGrpcForInserts booleanstring

Use gRPC for batch inserts

If true, use gRPC for batch inserts. HTTP remains required for search operations.

Ollama Model Provider

endpoint *string

Model endpoint

modelName *string

Model name

type *object

OpenAI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

API base URL

io.kestra.plugin.ai.rag.IngestDocument-InlineDocument

content *string

Document content

metadata object

Document metadata

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection

hosts *array

SubType string

Min items 1

List of HTTP Elasticsearch servers

Must be a URI like https://example.com: 9200 with scheme and port

basicAuth Elasticsearch-ElasticsearchConnection-BasicAuth

Basic authorization configuration

headers array

SubType string

List of HTTP headers to be sent with every request

Each item is a key: value string, e.g., Authorization: Token XYZ

pathPrefix string

Path prefix for all HTTP requests

If set to /my/path, each client request becomes /my/path/ + endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.

strictDeprecationMode booleanstring

Treat responses with deprecation warnings as failures

trustAllSsl booleanstring

Trust all SSL CA certificates

Use this if the server uses a self-signed SSL certificate

Elasticsearch Embedding Store

connection *Elasticsearch-ElasticsearchConnection

indexName *string

The name of the index to store embeddings

type *object

Amazon Bedrock Model Provider

accessKeyId *string

AWS Access Key ID

modelName *string

Model name

secretAccessKey *string

AWS Secret Access Key

type *object

modelType string

Default COHERE

Possible Values

COHERETITAN

Amazon Bedrock Embedding Model Type

​Ingest​Document

IngestDocument