Solution | News Sentiment Stock Alert System#

Overview#

Business Case#

Traders, equity analysts and portfolio managers have to leverage an ever growing stock of information to fuel their company analysis. Of vital interest is knowing what stocks are most likely to move based on current news sentiment, what are the underlying news events driving volatility for a specific ticker, and what historical insights can be gained through systematic analysis of past news events.

Automatic anomaly detection removes the need for costly or small scale labelled datasets, avoids unfocused manual review that is costly and inefficient, and works alongside purely automatic trading responses based on news sentiment which may miss important opportunities.

An easy-to-use interface allows for immediate insights, rapid drill-down, and deeper analysis of trends, all with a few clicks. Flexible design allows for enhancement or customization to meet a team of firms specific needs.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance.

  • A Python 3.8 code environment named solution_stock-alert-system with the following required packages:

scikit-learn>=1.0,<1.1
dash==2.7
dash_bootstrap_components==1.2.1
tzlocal==4.2
plotly==5.13.0

Data Requirements#

The Dataiku Flow is built using publicly available data on stock prices and the news. The solution itself doesn’t contain any direction connection to an external data source. Rather, input datasets should be separately retrieved and linked to the data sources necessary for the project to work.

Dataset

Description

tickers_information

Contains 1 row per ticker with information about the sector of the stock.

stock_prices_all

Contains historical stock prices for all tickers contained in tickers_information. One row should correspond to one day for one ticker.

news_data

Historical news with the tickers labeled. Each row contains an individual piece of news. The time period of this dataset should match stock_prices_all.

news_today

Contains latest news with the same format as for historical news.

Workflow Overview#

You can follow along with the sample project in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow Zones.

The project has the following high level steps:

  1. Input the list of tickers and news source on which the analysis will be done via Project Setup.

  2. Retrieve stock prices and news into partitioned datasets.

  3. Analyze the stock prices to detect anomalies.

  4. Train a model to predict stock price anomalies using the news.

  5. Score real time data to produce risk scores and impact rankings.

  6. Visualize data using a pre-built Webapp and Dashboard insights.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Tailor the Alert System to your interests#

By default the project already contains tickers for the S&P 500 stocks and news data from 2022. The project can be overridden by utilizing the built-in Project Setup which can be accessed from the homepage of the project. The Project Setup enables users to connect their own input data and rebuild the entire Flow with this new data. Real Time Risk Scoring can also be manually un from Project Setup interface. Changing the connected datasets will impact the Input Data Flow zone.

Dataiku screenshot of the accompanying Project Setup for this solution

Detect Anomalies in Stock Prices#

An anomaly is defined as a move that is peculiar with regards to the historical moves of a stock. Further detail on what we consider as peculiar behavior for stocks in this solution, how we cluster stocks, and the anomaly detection algorithm used is detailed in the Anomaly Detection section of the wiki.

Dataiku screenshot of the Flow zones involved in detecting anomalies in stock prices.

The analysis to detect anomalies is comprised of 4 parts:

Header

Header

Data Preparation

Is where stock prices are processed in order to compute the log returns.

PCA Construction

Includes a python recipe that takes the data with the log returned from the previous zone, computes the covariance matrix and then runs the Principal Components Analysis (PCA). The recipe outputs the coordinates on the first four Principal Components for each stock.

Stock Clustering

Takes the PCA coordinates and runs clustering on the stocks using a K-Means algorithm. The algorithm and cluster number (8) used for this solution was chosen for its simplicity but other algorithms could be tried for more in depth cluster analysis.

Anomaly Detection

Partitions the initial log return dataset using the clusters output from the Stock Clustering zone so that anomaly detection is run on each partition independently. Anomaly detection is based off of Mahalanobis Distance computations run within a python recipe and labelled based on a predefined threshold.

Train predictive models and score real-time data#

The processed news data and cleaned stock pricing data are joined and further cleaned in the Cross Data Analysis zone. The combined dataset is then used to train a logistic regression model to detect anomalies. The final model is used to score real time data within the Real Time Alert Flow zone to produce a risk score for each stock today. Additionally, individuals news events are ranked with regards to the impact they have on related stocks movement. Past data is also scored within the Visualization zone in order to enable users to investigate past news events with large impacts on stock movements within the webapp interface.

Dataiku screenshot showing the visual analysis of the logistic regression model used to detect anomalies.

Two scenarios have been created in order to automate the Flow and keep it up to date with real-time data.The Overnight Batch scenario adds the previous day’s data and updates the models. Real Time Risk Scoring retrieves the most recent news, processes and scores them to feed real-time investigation of stocks from the WebApp.Additional configurations can be made to these scenarios to send reports.

Investigate the impact of news on stock prices#

The solution dashboard to consume the results of the analysis. The WebApp, contained in the first page of the dashboard, is made up of four tabs:

Tab

Description

Real Time News Scoring

Gives, in real time, the volatility score per stock and allows users to browse through news of the day. Each row of the first table of stocks is selectable in order to filter a second table of news articles that impact a particular stock. The whole view is reset at midnight UTC but will re-populate throughout the day.

Dataiku screenshot of the final project WebApp that can be used to observe changes in stock prices and news events.

Case Study

Makes it possible to navigate through past anomalies detected by the algorithm and visualize the price evolution and the news around the anomaly. Once again, the first table consists of selectable rows that will update a graph of the price of the stock around an anomalous event and a table showing news leading up to and following the event.

Historical Prices Anomaly Detection

Presents a visualization of the historical prices of a given stock with an adjustable time frame.

Historical New Scoring

Enables users to browse through the full news dataset that has been processed in the project for each stock.

Dataiku screenshot of one of the insights available via the project's prebuilt dashboard.

The dashboard contains 4 additional tabs (Alerts, Model, Anomalies, and Clusters) to allow users to visualize the real time view of the scores by stock, a report on the New Scoring Model, insights into the anomalies detected, and visual cluster analysis.

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this project is to enable traders, equity analysts, and portfolio managers to understand how Dataiku can be used to ever growing stock of information to know what stocks are most likely to move based on current news sentiment, what are the underlying news events driving volatility for a specific ticker, and what historical insights can be gained through systematic analysis of past news events.

By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, smarter and more holistic strategies can be designed in order to reduce costs, avoid unfocused manual review, and work alongside automatic trading responses.

We’ve provided several suggestions on how public stock and news data can be used to detect news sentiment drive stock anomalies but ultimately the “best” approach will depend on your specific needs and your data.

If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.