How-to | Extract full text content into a dataset#

Learn how to extract content from documents using the Extract content recipe in Dataiku.

  1. In the Flow, select a managed folder that contains documents of different formats (.pdf, .docx, .pptx, .txt, .md, .html, .png or .jpg).

    Note

    You can store the documents in a folder in Dataiku or you can connect to an external storage, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, etc. In this case, your administrator must have configured the relevant connection to the external storage in Dataiku beforehand.

  2. In the right panel, open the Actions (Actions icon.) tab, and click Extract content under the Visual Recipes section.

  3. In the New extract content recipe window, name the output dataset and click Create Recipe.

  4. In the Strategy tab, select an extraction strategy (text-only, visual-first, or custom).

  5. Click Run to execute the recipe.

    Screenshot of the settings page of an Extract content recipe.
  6. Once the recipe has finished running, explore the output dataset. You will find the extracted content in different columns, such as source_file, extracted_content, content_id.

You can now use the output dataset in the Flow to:

This enables you to build a clean end-to-end pipeline from raw documents to actionable insights.