How-to | Extract specific fields from documents#
Learn how to extract specific fields from documents and organize them into a structured dataset without writing custom code.
In the Flow, select a folder containing your documents, such as PDFs, PPTX, Word files or images.
Note
You can store the documents in a folder in Dataiku or you can connect to an external storage, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, etc. In this case, your administrator must have configured the relevant connection to the external storage in Dataiku beforehand.
In the Actions panel, under the Visual Recipes section, click Extract content > Extract fields.
Name the output dataset and create the recipe.
In the Config step:
Select a VLM from your configured connections.
Note
Use models that support structured output (such as those from OpenAI, Azure OpenAI, or Vertex AI).
(Optional) If you enable the Customize extraction prompt option, provide context to the model, such as the document type (e.g., invoices) or how to handle missing values.
In the Extraction Schema section of the Config step, define the structure of your output dataset:
Click + Add Field to define a new field.
Enter a field name and select a data type. The field name will become the column header in the output dataset.
(Optional) Provide a description to help the VLM locate the correct value.
Note
For lists or tables in the source document, use the
arraytype and define subfields (for example,item_nameoritem_quantity).Test the extraction:
Use the Document preview on the left to select a file from your folder.
Click Test Field Extraction to see a preview of how the VLM interprets your schema for that specific document.
If the extraction quality isn’t satisfactory, refine your field descriptions or try a different VLM available in your configured connections.
(Optional) In the Output step, configure the output:
Choose an update method.
Set how to handle type mismatches.
Indicate whether to expand arrays into multiple rows in the output dataset.
Click Run to execute the recipe.
You can now use the output dataset in the Flow to:
Perform further transformations via visual recipes,
Train machine learning models, or
Leverage Generative AI features (such as using the Prompt recipe, classification, or enriching an LLM using the Embed dataset recipe).
This enables you to build a clean end-to-end pipeline from raw documents to actionable insights.
