Parse

Parse a document with Tika and extract its content and metadata.

yaml
type: "io.kestra.plugin.tika.Parse"

Examples

Extract text from a file.

yaml
id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    extractEmbedded: true
    store: false

Extract text from an image using OCR.

yaml
id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION
    store: true

Download and extract image metadata using Apache Tika.

yaml
id: parse-image-metadata-using-apache-tika
namespace: company.team

tasks:
  - id: get_image
    type: io.kestra.plugin.core.http.Download
    uri: https://kestra.io/blogs/2023-05-31-beginner-guide-kestra.jpg

  - id: tika
    type: io.kestra.plugin.tika.Parse
    from: "{{ outputs.get_image.uri }}"
    store: false
    contentType: TEXT
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION

Download a PDF file and extract text from it using Apache Tika.

yaml
id: parse-pdf
namespace: company.team

tasks:
  - id: download_pdf
    type: io.kestra.plugin.core.http.Download
    uri: https://huggingface.co/datasets/kestra/datasets/resolve/main/pdf/app_store.pdf

  - id: parse_text
    type: io.kestra.plugin.tika.Parse
    from: "{{ outputs.download_pdf.uri }}"
    contentType: TEXT
    store: false

  - id: log_extracted_text
    type: io.kestra.plugin.core.log.Log
    message: "{{ outputs.parse_text.result.content }}"

Properties

charactersLimit integerstring

Set maximum number of characters to include in the string, or -1 (default) to disable the write limit.

contentType string

Default XHTML

Possible Values

TEXTXHTMLXHTML_NO_HEADER

The content type of the extracted text

extractEmbedded booleanstring

Default false

Set whether to extract the embedded document.

from string

The file to parse

Must be an internal storage URI.

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

ocrOptions Parse-OcrOptions

Default

{
  "strategy": "NO_OCR"
}

Custom options for OCR processing

You need to install Tesseract to enable OCR processing.

store booleanstring

Default true

Set whether to store the data from the query result into an Ion serialized data file in Kestra internal storage.

Outputs

result Parse-Parsed

uri string

Format uri

Definitions

io.kestra.plugin.tika.Parse-OcrOptions

enableImagePreprocessing booleanstring

Whether to enable image preprocessing.

Apache Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to Tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

language string

Language used for OCR.

strategy string

Default NO_OCR

Possible Values

AUTONO_OCROCR_ONLYOCR_AND_TEXT_EXTRACTION

OCR strategy to use for OCR processing.

You need to install Tesseract to enable OCR processing, along with Tesseract language pack.

io.kestra.plugin.tika.Parse-Parsed

content string

embedded object

SubType string

metadata object

​Parse

Parse