#7 USAspending.gov Analytics — Time Series Forecasting with Facebook’s Open Source Prophet Package

Gaining an Analytics Edge Using Federal Spending Open Data Series

Context

Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.

Introduction

“…If you can look into the seeds of time, and say which grain will grow and which will not, speak then unto me…” Banquo, Scene III, MacBeth, William Shakespeare

Loading The USASpending.gov Data

The examples that follow use the data collected in the previous blog posts in this series. This one will get you started with how to download and organize the data #3 Simplified Download of USAspending.gov GFY Archives.

Time Series Analytics — First Pass

Now that we’ve loaded the data, let’s dig into the time-series analytics. The first step is to load the Facebook Prophet packages. If you are setting this up on your local machine, I highly recommend setting up a Python virtual environment. Use venv or conda to create a Python virtual environment dedicated to this and install the latest packages needed there. Having a dedicated virtual environment with the required Python packages will avoid the hassle of potential conflicts with other packages and enable you to use the Prophet’s latest version.

  • Execute a convenience function I created call Prepare_df_USAspending_for_Prophet. This function normalizes the data using the numpy natural log function np.log since the range of dollar obligation values from thousands (10e3) to tens of billions (10e10). I later convert those values back to the proper units using np.exp.
  • I construct the Prophet model using Build_Prophet_Model_forecast
  • yhat: is the estimated value expected for that date (yhat is the mathematical symbol used by statisticians to represent a predicted value from a model)
  • yhat_lower: this is the lower bound of the 95% confidence interval, which is ~1.95 times the standard deviation (aka sigma)
  • yhat_upper: this is the upper bound of the 95% confidence interval, which is ~1.95 times the standard deviation (aka sigma)
  • You may see a yhat_lower that is less than $0, which may seem odd. There are de-obligations in the data that are negative numbers, and the model accounts for that. Some of those de-obligations are due to data entry errors correcting obligations within a week or two of entry that may be off by a factor of 10 or 1000. The graph below plots the forecast vs. actual data from GFY2010 through GFY2022 (using GFY2010–GFY2019 to predict GFY2010-GFY2022). The outliers are visible. I will remove those in the next example. The red lines indicate where the model detected a significant change in the trend lines. The GFY2015 market bottom and subsequent rise can be seen in the far right red vertical line between GFY2015 and GFY2016.

Time Series Analytics — Cleaning Up the Largest Outliers

One of the problems with using the data as-is is that there are some large outliers. This set includes outliers that are data entry errors. We need to remove those entered in error to build a model that will have general utility for forecasting and time-series predictions. Some outliers are real such as multi-year military aircraft sales (US and foreign military sales), but some are not.
Unfortunately, the backout transactions from data entry errors are not always the same as the error. This issue makes it harder to find when the dollar value does not make them look like an outlier. Often, the backout transaction is a negative number adjusted for what should have been the correct amount, so the net is accurate. The backout transactions (negative amounts) are usually within a week or two of the data entry error.

Time Series Analytics — Assessing Model Quality

Prophet includes some useful diagnostic tools to assess the quality of a model. One of the most popular tools for time-series model assessment is cross-validation. Prophet makes that simple to do. Below is the code that uses the Dask package.

Time Series Analytics — Focus on a Federal Department (DHS)

We can now use the same tools and process we used above to look at ANY subset of the data. We can look by Federal product or service code, geography, Department, Agency, contracting office, set-aside codes, etc. In this case, I illustrate using the same techniques from above to the Department of Homeland Security (DHS). Note that I did NOT remove outliers, nor did I use the np.log normalization transformations for this example. You can try it with those and the previous convenience functions I created to compare the model results with and without normalization. Does it make a major difference in the predictions and patterns uncovered to normalize the data?

Daily $ Obligations Trend Line
Yearly Seasonal Model (Daily Trend Multiplier)
Daily Model (Daily Trend Multiplier)

Time Series Analytics — Monte Carlo Forecasting

Once you have the Prophet forecasting data, it opens up other forecasting tools. Below is an example of creating a simple Monte Carlo simulation to look at the range of outcomes for the DHS budget in GFY2020.

Time Series Analytics — Focus on Contractors

In this section, I use the same techniques as above to look at the Prophet discovered patterns associated with specific contractors. I chose two large contractors to illustrate how the model results adjust to the subsets selected.

Conclusion

With a few lines of Python code, you can construct time-series models to look for trend changes and patterns that may indicate new growth opportunities or spending areas that have likely peaked.

Coming Attractions

I will explore more analytical and market topics in future blog posts — including market-share trend analysis and NLP techniques using USAspending.gov data and mashups with other open data sources.

Strategy, emerging technology, innovation, and management advisor https://www.primehookllc.com/about-us.html, https://www.american.edu/kogod/faculty/ulstrup.cfm

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store