Solution | Interactive Document Intelligence for ESG#

Overview#

Business case#

Financial service firms have a large document corpus (both digitized and native images) with valuable opportunities from harnessing insights and trends within this unstructured data. Many organizations rely on individuals to read sections of these documents, or search for relevant materials in an ad hoc manner. They have no systematic way of categorizing and understanding the information and trends.

This solution automatically consolidates unstructured document data into a unified, searchable and automatically categorized database, with insight accessible via a powerful and easy to use dashboard.

The project accepts any document set as input. It sends each document through a modular and reusable pipeline to automatically digitize documents, extract text, and consolidate data. It applies multiple NLP techniques to this data based on themes of interest (in this project: ESG), with additional theme modules available. If interested, Dataiku can offer roll-out and customization services on demand.

Installation#

The process to install this Solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data requirements.

Technical requirements#

To leverage this Solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance.

  • A Python 3.6 code environment named solution_document-intelligence with the following required packages:

    PyMuPDF==1.18.19
    regex==2022.10.31
    pyldavis==3.2.2
    dash==2.7.0
    nltk==3.6.7
    torch==1.10.2
    transformers==4.16.2
    weasyprint==54.3
    seaborn==0.11.2
    scikit-learn==0.24.2
    wordcloud==1.8.2.2
    tokenizers==0.10.3
    
  • When building the Python code env, you must run an additional script to provide code environment resources:

    from dataiku.code_env_resources import clear_all_env_vars
    from dataiku.code_env_resources import set_env_path
    from dataiku.code_env_resources import set_env_var
    import os
    
    # Clears all environment variables defined by previously run script
    clear_all_env_vars()
    
    ## Hugging Face
    # Set HuggingFace cache directory
    set_env_path("HF_HOME", "huggingface")
    
    import transformers
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
    
    set_env_path("NLTK_DATA", "nltk_data")
    
    # Import NLTK
    import nltk
    
    # Download model: automatically managed by NLTK, doesn't download
    # anything if model is already in NLTK_DATA.
    nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])
    
  • Tesseract must be installed on the server or machine running Dataiku.

Data requirements#

Note

This solution uses SEC data pulled via the EDGAR API, but isn’t endorsed or certified by these organizations. By utilizing this Solution, you agree to abide by the terms set forth on these data sources.

The project initially includes SEC data from EDGAR to demonstrate functionality. You can replace it or supplement it with the document source of choice by changing the Flow. Dataiku can offer roll-out and customization services on demand.

Workflow overview#

You can follow along with the sample project in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The Solution has the following high level steps:

  1. Pull public data and process documents.

  2. Extract windows from documents and apply sentiment analysis.

  3. Unveil abstract “topics” occurring in the document corpus.

  4. Enable business users with interactive mode.

  5. Conduct real-world analysis with demonstration mode.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Digitize, analyze and consolidate thousands of documents#

To tailor the project to your own corpus of documents, you should use the Project Setup feature of Dataiku. You can find it by navigating to the project’s homepage and clicking the Project Setup button. Here you can upload your corpus of documents, document metadata, and keyword list before running the full Flow and updating the dashboard.

Dataiku screenshot of the project setup interface for uploading your own data.

Four Flow zones comprise the underlying Flow upon which the interactive dashboards are built. You can customize the full Flow and its individual Flow zones based on the specific needs of the user. You can understand the specific details on how to change the Flow and the impact of possible changes by reading the project wiki.

Flow zone

Description

SEC Data Pull

Used to get sample documents from the EDGAR database to run through the Flow. If you have your own corpus of documents to process, you can substitute them into this section via the Project Setup instead of the SEC data. The output of this zone, example_documents is passed along to the next Flow zone for analysis.

document pre-processing

The processed folder of documents is classified based on whether it’s digital, a native image, or unable to be processed. Text is extracted from digital documents using PyMuPDF while the Tesseract Plugin is used to perform optical character recognition (OCR) on native images. You can change this part of the Flow to instead use your preferred OCR tool.

window extraction and sentiment analysis

A unified data repository documents_processed_join is passed from the last Flow zone to the window extraction and sentiment analysis Flow zone. Here you can search through the document to extract windows based on a key word list or a category list before applying FinBERT to analyze sentiment of the window. This project has been designed with keywords pertaining to environmental, social, and governance.

topic modeling

Unveils abstract topics that occur in the document corpus. This particular Flow has ben built to run LDA topic modeling. Users of this Solution should view this final Flow zone as an example of an NLP downstream analysis that you can be perform on a document database.

Interactive search and insight generation for business users#

Business users can interactively consume the NLP module results with the pre-built Interactive Document Intelligence Dashboards consisting of 3 main features.

Users interested in analyzing high level trends and aggregated insights will benefit from the Interactive Dashboard: FinBERT Sentiment Analysis tab of the Dashboard. Here, you can search for a company name and select multiple categories from the initial category list. This can show the resulting sentiment analysis of each document in the corpus using the provided webapp.

You can also drilldown even further from here into subcategories to view the extracted windows (on which sentiment analysis was applied) and their associated sentiment score. The Solution also provides users with a document viewer in this tab so that you can refer back to the originating document when viewing sentiment scores.

Dataiku screenshot of a sample side-by-side comparison of a digitized document and its Social sentiment score

The Time Series Frequency Analysis tab delivers several charts with which you can track the frequency of key words and sentiment over time. The charts includes all companies so you can compare them to one another by altering the filters.

Track key words and sentiment over time throughout the entire history of your document corpus.

Lastly, the Topic Modeling tab makes several important visualizations available. The topic modeling Flow zone generates a word cloud for each topic found by the LDA model. This cloud visualizes the most common words per topic. Below the word clouds, you can interactively visualize the top 30 words found per topic and conduct exploratory analysis in an easy-to-understand visual manner.

Visualize the most common words per topic to identify additional relevant keywords in your document corpus.

Reproducing these processes with minimal effort for your data#

The intent of this project is to enable financial experts to understand how they can use Dataiku to leverage ESG insights in decision making. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, you can design smarter and more holistic strategies to track ESG trends over time, inform ESG-integrated judgment, and scan through large corpuses of documents for ESG insights.

This documentation has reviewed provided several suggestions on how to derive value from this Solution. Ultimately however, the “best” approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, Dataiku offers roll-out and customization services on demand.