Parse Parse

yaml
type: "io.kestra.plugin.tika.Parse"

Parse a document and extract its content and metadata.

Examples

Extract text from a file.

yaml
id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    extractEmbedded: true
    store: false

Extract text from an image using OCR.

yaml
id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION
    store: true

Properties

contentType

  • Type: string
  • Dynamic:
  • Required:
  • Default: XHTML
  • Possible Values:
    • TEXT
    • XHTML
    • XHTML_NO_HEADER

The content type of the extracted text.

extractEmbedded

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: false

Whether to extract the embedded document.

from

  • Type: string
  • Dynamic: ✔️
  • Required:

The file to parse.

Must be an internal storage URI.

ocrOptions

Custom options for OCR processing.

You need to install Tesseract to enable OCR processing.

store

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: true

Whether to store the data from the query result into an ion serialized data file in Kestra internal storage.

Outputs

result

uri

  • Type: string
  • Required:
  • Format: uri

Definitions

io.kestra.plugin.tika.Parse-OcrOptions

Properties

enableImagePreprocessing
  • Type: boolean
  • Dynamic:
  • Required:

Whether to enable image preprocessing.

Apache Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to Tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

language
  • Type: string
  • Dynamic: ✔️
  • Required:

Language used for OCR.

strategy
  • Type: string
  • Dynamic:
  • Required:
  • Default: NO_OCR
  • Possible Values:
    • AUTO
    • NO_OCR
    • OCR_ONLY
    • OCR_AND_TEXT_EXTRACTION

OCR strategy to use for OCR processing.

You need to install Tesseract to enable OCR processing, along with Tesseract language pack.

io.kestra.plugin.tika.Parse-Parsed

Properties

content
  • Type: string
  • Dynamic:
  • Required:
embedded
  • Type: object
  • SubType: string
  • Dynamic:
  • Required:
metadata
  • Type: object
  • Dynamic:
  • Required:

Was this page helpful?