Solution | Interactive Document Intelligence for ESG#

Overview#

Business Case#

Financial service firms have a large document corpus (both digitized and native images) with valuable opportunities from harnessing insights and trends within this unstructured data. Many organizations rely on individuals to read sections of these documents, or search for relevant materials in an ad hoc manner, with no systematic way of categorizing and understanding the information and trends.

This solution automatically consolidates unstructured document data into a unified, searchable and automatically categorized database, with insight accessible via a powerful and easy to use dashboard. The project accepts any document set as input, and each document is sent through a modular and reusable pipeline to automatically digitize documents, extract text, and consolidate data. Multiple NLP techniques are applied to this data based on theme of interest (in this project: ESG), with additional theme modules available. If interested, roll-out and customization services can be offered on demand.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

This solution is not available on Dataiku Cloud. Although you may try to import the zip file found in the self-managed instructions onto a Cloud instance, Dataiku offers no support in this case.

After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:

  1. On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for Interactive Document Intelligence for ESG.

  2. Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance.

  • A Python 3.6 code environment named solution_document-intelligence with the following required packages:

    PyMuPDF==1.18.19
    regex==2022.10.31
    pyldavis==3.2.2
    dash==2.7.0
    nltk==3.6.7
    torch==1.10.2
    transformers==4.16.2
    weasyprint==54.3
    seaborn==0.11.2
    scikit-learn==0.24.2
    wordcloud==1.8.2.2
    tokenizers==0.10.3
    
  • When building the Python code env, an additional script must be run to provide code environment resources:

    from dataiku.code_env_resources import clear_all_env_vars
    from dataiku.code_env_resources import set_env_path
    from dataiku.code_env_resources import set_env_var
    import os
    
    # Clears all environment variables defined by previously run script
    clear_all_env_vars()
    
    ## Hugging Face
    # Set HuggingFace cache directory
    set_env_path("HF_HOME", "huggingface")
    
    import transformers
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
    
    set_env_path("NLTK_DATA", "nltk_data")
    
    # Import NLTK
    import nltk
    
    # Download model: automatically managed by NLTK, does not download
    # anything if model is already in NLTK_DATA.
    nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])
    
  • Tesseract must be installed on the server or machine running Dataiku.

Data Requirements#

Note

This solution uses SEC data pulled via the EDGAR API, but is not endorsed or certified by these organizations. By utilizing this solution, you agree to abide by the terms set forth on these data sources.

The project initially includes SEC data from EDGAR to demonstrate functionality. This can be replaced or supplemented by the document source of choice by changing the Flow. Roll-out and customization services can be offered on demand.

Workflow Overview#

You can follow along with the sample project in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The solution has the following high level steps:

  1. Pull public data and process documents.

  2. Extract Windows from documents and apply Sentiment Analysis.

  3. Unveil abstract “topics” occurring in the document corpus.

  4. Enable business users with Interactive Mode.

  5. Conduct real-world analysis with Demonstration Mode.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Digitize, analyze and consolidate thousands of documents#

To tailor the project to our own corpus of documents, we should use the Project Setup feature of Dataiku which can be found by navigating to the project’s homepage and clicking the PROJECT SETUP button. Here we can upload our corpus of documents, document metadata, and keyword list before running the full Flow and updating the dashboard.

Dataiku screenshot of the project setup interface for uploading our own data

The underlying Flow upon which the interactive dashboards are built is comprised of 4 Flow zones. The full Flow and its individual Flow zones can be customized based on the specific needs of the user. Specific details on how to change the Flow and the impact of possible changes can be understood via the project wiki.

Flow zone

Description

SEC Data Pull

Is used to get sample documents from the EDGAR database to run through the Flow. If you have your own corpus of documents to process, they can be subbed in to this section via Project Setup in lieu of the SEC data. The output of this zone, example_documents is passed along to the next Flow zone for analysis.

document pre-processing

The processed folder of documents is classified based on whether it is digital, a native image, or unable to be processed. Text is extracted from digital documents using PyMuPDF while the Tesseract Plugin is used to perform optical character recognition (OCR) on native images. This part of the Flow can be changed to instead use your preferred OCR tool.

window extraction and sentiment analysis

A unified data repository documents_processed_join is passed from the last Flow zone to the window extraction and sentiment analysis Flow zone. Here we search through the document to extract windows based on a key word list or a category list before applying FinBERT to analyze sentiment of the window. This project has been designed with keywords pertaining to Environmental, Social, and Governance.

topic modeling

Unveils abstract topics that occur in the document corpus. This particular Flow has ben built to run LDA topic modelling. Users of this solution should view this final Flow zone as an example of an NLP downstream analysis that can be performed on a document database.

Interactive search and insight generation for Business users#

Business users can easily and interactively consume the NLP module results with the pre-built Interactive Document Intelligence Dashboards consisting of 3 main features.

Users interested in analyzing high level trends and aggregated insights will benefit from the Interactive Dashboard: FinBERT Sentiment Analysis tab of the Dashboard. Here, we can search for a company name and select multiple categories from the initial category list to see the resulting Sentiment Analysis of each document in the corpus using the provided webapp. We can also drilldown even further from here into subcategories to view the extracted windows (on which sentiment analysis was applied) and their associated sentiment score. The solution also provides users with a document viewer in this tab so that we can easily refer back to the originating document when viewing sentiment scores.

Dataiku screenshot of a sample side-by-side comparison of a digitized document and its Social sentiment score

The Time Series Frequency Analysis tab delivers several charts with which we can track the frequency of key words and sentiment over time. The charts includes all companies so they can be compared to one another by altering the filters.

Track key words and sentiment over time throughout the entire history of your document corpus.

Lastly, several important visualizations are made available in the Topic Modeling tab. From the topic modeling Flow zone, a word cloud is generated for each topic found by the LDA model. With this cloud we can visualize the most common words per topic. Below the word clouds we can interactively visualize the top 30 words found per topic and conduct exploratory analysis in an easy-to-understand visual manner.

Visualize the most common words per topic to identify additional relevant keywords in your document corpus that can be used in future sentiment analysis

Reproducing these Processes With Minimal Effort For Your Data#

The intent of this project is to enable financial experts to understand how Dataiku can be used to leverage ESG insights in decision making. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, smarter and more holistic strategies can be designed to track ESG trends over time, inform ESG-integrated judgement, and scan through large corpuses of documents for ESG insights.

We’ve provided several suggestions on how to use public SEC data but ultimately, the best approach will depend on your specific needs and your data. If you’re interested in adopting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.