#7 USAspending.gov Analytics — Time Series Forecasting with Facebook’s Open Source Prophet Package

15 min readOct 24, 2020

Gaining an Analytics Edge Using Federal Spending Open Data Series

Context

Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.

Purpose: Demonstrate to executives how easy it is to access and analyze free, open-data sources and motivate them to expect more insight and context from their team when making crucial go/no-go investment decisions on bids and new offerings.

Scope: US Federal Contracting Market.

TL;DR — Analyze GB’s of US Federal spending data with a few lines of Python code. Uncover competitive insights from information and analytics.

Previous Blog Post in this series: #6 USAspending.gov Analytics — Exploring Top Executive Compensation Patterns

Introduction

“…If you can look into the seeds of time, and say which grain will grow and which will not, speak then unto me…” Banquo, Scene III, MacBeth, William Shakespeare

Given that it’s impossible to predict the future, is there value in studying how the Federal market unfolds over time and building models to forecast results? I think the answer is YES.

Most business leaders are searching for the next opportunity for outsize market growth with a chance to enter the market early or shift investment away from an area likely to decline in revenue and margins. Is it possible to detect opportunity and threat signals in the Federal market earlier than competitors by studying how USAspending.gov data change over time?

“…It is far better to foresee even without certainty than not to foresee at all…” Henri Poincare

USAspending.gov contractor obligation data are a collection of date-stamped records. Each fiscal year archive file has 4–7 million records of transactions, which averages ~11k to ~20k per day. It is a rich source of insight into how the market has evolved and helps tee up questions about the market’s future that can lead to entering, exiting, or ignoring market segments. Here is a sample of transaction records with only 7 of 280+ fields of information:

One of the advantages of forecasting Federal contractor obligations data is that the spending is within the envelope of a Congressional budget, and it usually changes slowly at the Agency or product or services level. Unlike many other markets with higher volatility, we have a good chance of bounding our forecasts with confidence.

I’ve chosen to use the open-source Facebook Prophet tool to uncover time-varying patterns and constructing forecasts in the examples that follow. There are many different tools to study and model time series data. The most frequently referenced modeling technique is ARIMA (Autoregressive integrated moving average). ARIMA requires setting several parameters and tuning the model, which may turn off a typical business analyst looking for a “good enough” solution that is easy to set up and use. Facebook’s Prophet meets that need and is also scalable. Prophet is designed to handle the vast datasets that Facebook must process to forecast user behavior and system demands.

Here are some references to learn more about how Prophet works:

For the remainder of this, I will reference outputs from Python code in a jupyter notebook. The jupyter notebooks with the Python code for the examples below can be found here on github.

Loading The USASpending.gov Data

The examples that follow use the data collected in the previous blog posts in this series. This one will get you started with how to download and organize the data #3 Simplified Download of USAspending.gov GFY Archives.

In this step shown below, I check to see if there is a parquet format file with the extracted data already to speed reading the data. If not, I read from the CSV files and then save that in a parquet file for use when I return to the notebook for additional analysis. This technique speeds the process for subsequent visits and compresses the storage needed.

I confirm that the pandas Dataframe (a spreadsheet-like table structure) has properly loaded the records by using the pandas df.head() function.

I also check to see that the number of records and storage size. The Government Fiscal Year (GFY) 2010 through GFY 2020 (year to date through early September 2020) data consists of more than 46 million records. I restricted the number of columns to just those I need for this time series analysis, so it only takes up 4.4 GB of RAM to easily and quickly work with this dataset in-memory.

I also apply a pandas groupby to the data to compute the total obligations for each year to make sure I have the whole dataset. I also do a quick check with the GAO contractor spending resources to ensure the totals are close. They are. GAO computes $586B (e11) for GFY2019, and my calculation is ~$590B. They published their document on 5/20/2020, and I used data updated on 9/9/2020. That is less than ).1% difference. It is not unusual for there to be accounting adjustments years after the GFY closes.

For those of us working in the Federal contracting market, the steep decline was challenging. The rise since the GFY2015 market bottom is steep.

How long can that continue at that rate of growth?

Time Series Analytics — First Pass

Now that we’ve loaded the data, let’s dig into the time-series analytics. The first step is to load the Facebook Prophet packages. If you are setting this up on your local machine, I highly recommend setting up a Python virtual environment. Use venv or conda to create a Python virtual environment dedicated to this and install the latest packages needed there. Having a dedicated virtual environment with the required Python packages will avoid the hassle of potential conflicts with other packages and enable you to use the Prophet’s latest version.

The latest version of Prophet takes advantage of the power of Dask and multiple cores on your machine (or even multiple machines in a cluster) for the cross-validation steps below to evaluate model performance. This next step checks to see if Dask is installed. It creates a variable to adjust Prophet’s settings to sped processing.

A simple set of steps are needed to create a Prophet time-series model and a forecast. I have developed a simple convenience function called Build_Prophet_Model_forecast to abstract a few of the settings. That function is short and further simplifies using Prophet in subsequent examples. The steps below are:

Select a subset of the records through GFY2019 and omit GFY2020 since it is incomplete (we are in that GFY, and the DoD records will not be complete until January 2021)
Execute a convenience function I created call Prepare_df_USAspending_for_Prophet. This function normalizes the data using the numpy natural log function np.log since the range of dollar obligation values from thousands (10e3) to tens of billions (10e10). I later convert those values back to the proper units using np.exp.
I construct the Prophet model using Build_Prophet_Model_forecast

Here is a sample of what Prophet produces:

It has many more fields than the four in the table, but those are the essence of what we need for the subsequent analysis. Those fields are:

ds: datestamp
yhat: is the estimated value expected for that date (yhat is the mathematical symbol used by statisticians to represent a predicted value from a model)
yhat_lower: this is the lower bound of the 95% confidence interval, which is ~1.95 times the standard deviation (aka sigma)
yhat_upper: this is the upper bound of the 95% confidence interval, which is ~1.95 times the standard deviation (aka sigma)
You may see a yhat_lower that is less than $0, which may seem odd. There are de-obligations in the data that are negative numbers, and the model accounts for that. Some of those de-obligations are due to data entry errors correcting obligations within a week or two of entry that may be off by a factor of 10 or 1000. The graph below plots the forecast vs. actual data from GFY2010 through GFY2022 (using GFY2010–GFY2019 to predict GFY2010-GFY2022). The outliers are visible. I will remove those in the next example. The red lines indicate where the model detected a significant change in the trend lines. The GFY2015 market bottom and subsequent rise can be seen in the far right red vertical line between GFY2015 and GFY2016.

This next chart is produced by Prophet. It plots the primary trend line pattern in the data. This chart’s Y-axis with the np.log (natural log) values is unlike the chart above since I normalized the values. The values come from the forecast variable trend field and can be adjusted using np.exp. The light blue fan from GFY2020 through GFY2022 shows the main trend line and the cone of uncertainty that grows with the forecast time horizon using the yhat_upper and yhat_lower variables. This graph depicts what the Prophet primary trendline predictor has discovered by analyzing the ~46 million records.
Note that the scale trend line predicts total daily obligations. For instance, exp(20.4) => ~$724M/day in obligations.

The next graph shows what the Prophet model has detected as the typical spending cycles for Federal obligations (the scale is np.log) yearly. Anyone with experience in the Federal market is acutely aware of the end of the Federal Fiscal year spending volume that extends to 9/30. The Thanksgiving and Christmas holiday declines are evident in late November and late December. Having spent much of my career serving Federal clients, I didn’t expect to see the sharp decline in October that is clear from the graph — time for a breather after the intense end of fiscal year obligation cycle ends. Note that the scale for this graph is a multiplier of the daily obligations shown above. For instance, late September (end of GFY) spending is exp(1.4) => ~4x the daily trend rate.

Lastly, the model detects obligations by day (np.log scale). Looking at this on the full dollar scale makes it easy to see that Thursday is the day with the highest volume across all of Federal. You can see other Agency-specific patterns when you restrict the data to a subset of interest and apply the same modeling technique. Had we had datestamps at the hourly level, it would have predicted that pattern as well. Note that the scale for this graph is a multiplier of the daily obligations shown above. For instance late Thursday spending is exp(1.1) => ~3x the daily trend rate while Monday’s are exp(0.7) => ~2x.

The graphs above depict how the model is constructed from a combination of the main trend line and periodic time components. It models holidays too. I included those in the model but did not show that graph above.

Time Series Analytics — Cleaning Up the Largest Outliers

One of the problems with using the data as-is is that there are some large outliers. This set includes outliers that are data entry errors. We need to remove those entered in error to build a model that will have general utility for forecasting and time-series predictions. Some outliers are real such as multi-year military aircraft sales (US and foreign military sales), but some are not.
Unfortunately, the backout transactions from data entry errors are not always the same as the error. This issue makes it harder to find when the dollar value does not make them look like an outlier. Often, the backout transaction is a negative number adjusted for what should have been the correct amount, so the net is accurate. The backout transactions (negative amounts) are usually within a week or two of the data entry error.

To find the high dollar data entry mistakes, I used the pandas groupby and aggregation calculation functions. I identify cases where the total obligations for the year at the funding_office_name level (the lowest organization level in the source data) are much smaller than the maximum obligation amount in the GFY transaction. That means there were one or more de-obligations that were sizable too. Caveat: Those may not be data entry errors in all cases.

The third (3rd) row in the table below shows a maximum obligation of ~$345 Trillion (e11), which is +60% of the entire Federal contracting spend for a GFY — clearly a data entry error. The annual total for that office in FY2013 was only $686M.

The chart below on the left is the same one produced by Prophet with the outliers. The Charter below on the right is what Prophet models and depicts and displays with the outliers removed. Note that the trend line change red vertical lines have not changed, but the graph’s scale for daily obligation totals is smaller by a factor of ten (10).

I have only explored one of many possible sources of issues you may encounter building statistical models when using real data with its usual messiness.

Time Series Analytics — Assessing Model Quality

Prophet includes some useful diagnostic tools to assess the quality of a model. One of the most popular tools for time-series model assessment is cross-validation. Prophet makes that simple to do. Below is the code that uses the Dask package.

The Prophet performance_metrics function computes the standard set of measures such as mean square error (mse), root mean square error (rmse), and ‘coverage’ that displays how much of the real data is predicted by the model.

Lastly, I created a simple function to compare how the model predictions for GFY2019 compare to the actual values for GFY2019.

Caution: this example violates a rule of excluding the test data from the training data — I should create a new model excluding from GFY2010-GFY2018. Since we excluded GFY2020 from the model, we can test how the model performed once we have the final GFY2020 numbers (January 2021). The graph below illustrates how an analyst can display the actual vs. predicted values from the Prophet model forecasts. In this case, I plot the cumulative total obligations (actual vs. predicted) for a GFY.

Time Series Analytics — Focus on a Federal Department (DHS)

We can now use the same tools and process we used above to look at ANY subset of the data. We can look by Federal product or service code, geography, Department, Agency, contracting office, set-aside codes, etc. In this case, I illustrate using the same techniques from above to the Department of Homeland Security (DHS). Note that I did NOT remove outliers, nor did I use the np.log normalization transformations for this example. You can try it with those and the previous convenience functions I created to compare the model results with and without normalization. Does it make a major difference in the predictions and patterns uncovered to normalize the data?

Yearly Seasonal Model (Daily Trend Multiplier)

The DHS obligation pattern is different than the Federal-wide pattern. The peak is Thursday, with Friday a close second, but the Monday-Wednesday obligation activity is lower. Thursday exp(2.1) => ~8.1 vs Monday exp(0.5) => ~1.6. Thursday/Monday => 5X

Time Series Analytics — Monte Carlo Forecasting

Once you have the Prophet forecasting data, it opens up other forecasting tools. Below is an example of creating a simple Monte Carlo simulation to look at the range of outcomes for the DHS budget in GFY2020.

This Python code uses the yhat, yhat_lower, and yhat_upper and the Python random package functions to simulate a thousand outcomes.

This histogram shows the average predicted spending outcome for DHS in GFY2020 with the red vertical line. If history is a guide, the result for contractor spending could be +/- that annual amount by 10% or more.

Time Series Analytics — Focus on Contractors

In this section, I use the same techniques as above to look at the Prophet discovered patterns associated with specific contractors. I chose two large contractors to illustrate how the model results adjust to the subsets selected.

The model predicts 5% YoY growth for GFY2020 vs GFY2019.

The three (3) year CAGR for this company is ~3.7%.

Here is an example of use the Monte Carlo simulation technique to predict prime obligations for this contractor for GFY2020.

This next example is for another contractor of the top ten (10) largest for GFY2019.

This contractor appears to be on a trend line to see a 2x increase in prime obligations! Is that real? Likely? Have I made a coding mistake?

Conclusion

With a few lines of Python code, you can construct time-series models to look for trend changes and patterns that may indicate new growth opportunities or spending areas that have likely peaked.

The art of this analytics work is applying your domain-specific knowledge of the market to the patterns that are revealed. You can test your marketing investment hypotheses to see whether the patterns you think are happening are evident in the data. It is tough to see these patterns without tools like Prophet.

Forecasting the daily or weekly obligation amounts has enormous uncertainty; however, monthly, quarterly, and yearly patterns are more reliable and can help your team understand what is changing in various market segments.

Please share your feedback and insight into how you are using this data and the techniques described here for your market insights.

Coming Attractions

I will explore more analytical and market topics in future blog posts — including market-share trend analysis and NLP techniques using USAspending.gov data and mashups with other open data sources.

The code listed in the examples above and more can be found here:

https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series