Pipelines reference - Preparers Innovation Release

This reference documentation for Pipelines preparers includes information on the functions and views available in the aidb extension related to preparers. See Usage and Examples for more details.

Views

aidb.preparers

Also referenceable as aidb.preps, the aidb.preparers view contains information about the configured data preparers. It includes the following columns:

ColumnTypeDescription
idINTEGER
nameTEXTName of the preparer.
operationaidb.DataPreparationOperationThe kind of processing step to perform.
destination_schemaTEXTSchema of the destination table to store the output data.
destination_tableTEXTName of the destination table to store the output data.
destination_key_columnTEXTColumn of the destination table that references the key in source data.
destination_data_columnTEXTColumn of the destination table to store the output data.
optionsJSONBConfiguration options for the data preparation operation. Uses the same API as the data preparation primitives.
source_typeTEXTType of source data the preparer is working with. Can be either 'Table' or 'Volume'.
source_schemaTEXTSchema of the table with the source data the preparer will process. Applies only to preparers of Table source type.
source_tableTEXTName of the table with the source data the preparer will process. Applies only to preparers of Table source type.
source_data_columnTEXTColumn in the source table with the source data the preparer will process. Applies only to preparers of Table source type.
source_key_columnTEXTName of the key column in the source table for reference with the output processed data. Applies only to preparers of Table source type.
source_volume_nameTEXTName of the volume to use as a data source. Applies only to preparers of Volume source type.

Types

aidb.DataPreparationOperation

The aidb.DataPreparationOperation type is an enum that represents the different types of preprocessing steps that can be performed:

  • ChunkText
  • SummarizeText
  • ParseHtml
  • ParsePdf
  • PerformOcr

Functions

aidb.create_table_preparer

Creates a preparer with a source data table.

Parameters

ParameterTypeDefaultDescription
nameTEXTRequiredName of the preparer
operationaidb.DataPreparationOperationRequiredType of data preparation operation
source_tableTEXTRequiredName of the source data table
source_data_columnTEXTRequiredColumn in the source table containing the raw data
destination_tableTEXTRequiredName of the destination table
destination_data_columnTEXTRequiredColumn in the destination table for processed data
source_key_columnTEXT'id'Unique column in the source table to use as key to reference the rows.
destination_key_columnTEXT'id'Key column in the destination table that references the source_key_column
optionsJSONB'{}'::JSONBConfiguration options for the data preparation operation. Uses the same API as the data preparation primitives.

aidb.create_preparer_for_table (deprecated)

Replaced by aidb.create_table_preparer with same arguments.

aidb.create_volume_preparer

Creates a preparer for a given PGFS volume.

Parameters

ParameterTypeDefaultDescription
nameTEXTRequiredName of the preparer
operationaidb.DataPreparationOperationRequiredType of data preparation operation
source_volume_nameTEXTRequiredName of the source volume containing the raw data
destination_tableTEXTRequiredName of the destination table
destination_data_columnTEXTRequiredColumn in the destination table for processed data
destination_key_columnTEXT'id'Key column in the destination table that uniquely identifies the processed data
optionsJSONB'{}'::JSONBConfiguration options for the data preparation operation. Uses the same API as the data preparation primitives.
invoker_roleTEXTNULLRole owning the tables, pipelines, and background job execution.

aidb.create_preparer_for_volume (deprecated)

Replaced by aidb.create_volume_preparer with same arguments.

aidb.bulk_data_preparation

Executes the configured data preparation operation on all data from the specified preparer’s source.

Parameters

ParameterTypeDescription
preparer_nameTEXTName of the preparer

aidb.set_auto_preparer

Sets the automatic processing mode for this preparer. This function is used to enable and disable automatic data preparation: Live mode enables the Postgres trigger-based automation and Disabled disables all automation.

Parameters

ParameterTypeDefaultDescription
preparer_nameTEXTName of preparer
modeaidb.PipelineAutoProcessingModeDesired processing mode

Example

SELECT aidb.set_auto_preparer('test_preparer', 'Live');
SELECT aidb.set_auto_preparer('test_preparer', 'Disabled');

aidb.set_preparer_auto_processing (deprecated)

Replaced by aidb.set_auto_preparer with same arguments.

aidb.delete_preparer

Deletes the preparer's configuration.

Parameters

ParameterTypeDescription
preparer_nameTEXTName of preparer to delete
Note

This function doesn't delete the destination table or any data in it.

Helper functions

Helper functions simplify the creation of configuration JSON for data preparation operations by providing a structured way to specify options.

aidb.chunk_text_config

Creates a configuration JSON object for the ChunkText operation.

Parameters

ParameterTypeDefaultDescription
desired_lengthINTEGERRequiredTarget chunk size (unit depends on strategy)
max_lengthINTEGERNULLMaximum chunk size (unit depends on strategy)
overlap_lengthINTEGERNULLAmount to overlap between consecutive chunks (unit depends on strategy)
strategyTEXTNULLChunking strategy: 'chars' (default) or 'words'. Determines the unit for all size parameters

Returns

JSONB configuration object for use with ChunkText operation.

Examples

-- Basic chunking with desired length only
SELECT aidb.chunk_text_config(100);

-- Chunking with max length
SELECT aidb.chunk_text_config(100, 150);

-- Chunking with overlap
SELECT aidb.chunk_text_config(100, 150, 20);

-- Character-based chunking with overlap
SELECT aidb.chunk_text_config(100, 120, 10, 'chars');

-- Use in a preparer
SELECT aidb.create_table_preparer(
    name => 'my_chunker',
    operation => 'ChunkText',
    source_table => 'source_data',
    source_data_column => 'text_content',
    destination_table => 'chunked_data',
    options => aidb.chunk_text_config(120, 150, 15)
);

aidb.summarize_text_config

Creates a configuration JSON object for the summarize text operation.

Parameters

ParameterTypeDefaultDescription
modelTEXTRequiredName of the model to use for summarization
chunk_configJSONBNULLOptional chunking configuration (created with chunk_text_config)
promptTEXTNULLCustom prompt to guide the summarization
strategyTEXTNULLSummarization strategy: 'append' (default) or 'reduce'
reduction_factorINTEGERNULLUsed with 'reduce' strategy to control how aggressively text is reduced in each iteration (default: 3)

Returns

JSONB configuration object for use with summarize text operation or summarize_text_aggregate function.

Examples

-- Basic summarization with model only
SELECT aidb.summarize_text_config('my_t5_model');

-- Summarization with chunking
SELECT aidb.summarize_text_config(
    'my_t5_model',
    aidb.chunk_text_config(100)
);

-- Summarization with custom prompt and append strategy
SELECT aidb.summarize_text_config(
    'my_t5_model',
    aidb.chunk_text_config(80, 80, 5, 'words'),
    'create a concise summary'
);

-- Summarization with reduce strategy
SELECT aidb.summarize_text_config(
    'my_t5_model',
    aidb.chunk_text_config(100, 100, 5, 'words'),
    'summarize the key points',
    'reduce',
    5
);

-- Use in a preparer
SELECT aidb.create_table_preparer(
    name => 'my_summarizer',
    operation => 'SummarizeText',
    source_table => 'source_data',
    source_data_column => 'text_content',
    destination_table => 'summarized_data',
    options => aidb.summarize_text_config('my_t5_model')
);

-- Use with aggregate function
SELECT
    category,
    aidb.summarize_text_aggregate(
        content,
        aidb.summarize_text_config('my_t5_model')::json ORDER BY id
    ) AS summary
FROM my_table
GROUP BY category;