#6 USAspending.gov Analytics — Exploring Top Executive Compensation Patterns

Gaining an Analytics Edge Using Federal Spending Open Data Series

Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.

Purpose: Demonstrate to executives how easy it is to access and analyze free, open-data sources and motivate them to expect more insight and context from their team when making crucial go/no-go investment decisions on bids and new offerings.

Scope: US Federal Contracting Market.

TL;DR — Analyze GB’s of US Federal spending data with a few lines of Python code. Uncover competitive insights from information and analytics.


In this post in my series on extracting analytical insights from USAspending.gov, I will show you how to explore a lesser-known source of analytical insights in the open data source — executive compensation information.

Link to the previous blog post: https://medium.com/@lulstrup/5-is-federal-spending-on-ai-growing-cbb2881cf0d3

Link to the Github repository with Python jupyter notebooks: https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series

Link to jupyter notebook used below: https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series/blob/master/Blog06_Relationship_Between_Executive_Compensation_Obligations.ipynb

Executive Compensation Reporting

USAspending.gov contract obligation archives include fields to capture the top five (5) highest compensated executives for companies meeting specific conditions. Here are the criteria:


Federal Register notice explaining the reporting (see “Disclosure of Executive Compensation”): https://www.federalregister.gov/documents/2012/07/26/2012-17724/federal-acquisition-regulation-reporting-executive-compensation-and-first-tier-subcontract-awards

Another source: https://www.law.cornell.edu/cfr/text/48/52.204-10

The availability of this information has grown over time (see graph below); however, less than 2% of all recipient parent firms contained this information, but more than 10% of all obligation transactions do. This means that the most active and Federal contractors are reporting this information.

Caveat: The data is not rigorously validated (one firm reported $1.9B/year for the top executive when the compensation was closer to $17m/year per SEC.gov reporting). There are data entry errors and variations in how top executives’ names appear.

Though the Federal regulation states that companies publicly traded and report their top five (5) executives to publicly available SEC.gov filings, many of those firms do include this information. Sometimes it is the names and compensation of the corporate officers, and other times it may be the names and compensation of the company’s Federal (or DoD) business unit leaders. When it is for publicly traded firm officers, this is useful in validating the data and identifying common data entry error patterns (e.g., an extra 0 or two).

Loading USAspending.gov Data Into Python Pandas (or Dask)

This analysis’s first step is to load the USAspending.gov Government Fiscal Year (GFY) decompressed (unzipped) CSV files into Python for analysis. I covered the steps to download and read the data in previous Medium blog posts.

In this example, I use pandas and assume you have enough RAM for this. If you do not, I’ve left a Dask version of loading the data in the jupyter notebook commented out.

For this example, I load GFY2010 through GFY2019; however, I am only loading a small subset of the ~280 fields in the CSV files. I am using these fields:

I send the contents of the variable usecols into the usecols argument of pandas read_csv function.

All ten GFY with just these fields takes up ~6.3GB of RAM (~630MB/GFY). If you are memory constrained, you can limit this to only reading one or two GFY by setting the user_select_GFY variable to just the GFY of interest.

Before diving into the executive compensation data, I include some quick calculations to confirm the data matches other sources. The first test is to confirm the reported total annual obligations match external reporting and previous analysis. This chart shows that it does (note units of $100B = 1e11):

I also compute the obligation transaction count trends:

Another quick calculation to see what the trends are for unique recipient_parent_names doing business in the Federal contracting market by GFY:

I was not expecting to see such a dramatic decrease in the number of unique parent recipient prime contractors since GFY2010.

~115k/180k => ~36% decline in the number of prime parent firms in GFY19 vs the previous spending peak in GFY2010.

Exploring the ‘Highly Compensated Officer’ Data

We turn our focus to exploring the highly_compensated_officer data in the USApsending.gov contracting files. From GFY2010 to GFY2019, only about 5% (~95% have $0) of obligation transactions include this data.

Without any data cleanup and removal of the $0.0 entries, this is what the data looks like using the convenient pandas .describe() function:

The count field is the number of obligation transactions and the mean field is the mean for the compensation data in the field highly_compensated_officer_1_amount.

You can see that the vast majority of the fields are $0.0 (the ~95% noted above). You can also see the max of $1.9B/year in compensation in GFY2019, which looks suspicious. Also, you see the same amount as the max every year in the preceding nine (9) years at $306M/year (another large number) — another unusual value to note and investigate. Another quick way to get a sense of the data is to use boxplots:

Note that I included the boxplot function argument showfliers=False. This option removes the outlier values such as the $1.9B/year and $306M/year amounts. It is noticeable that the values become more consistent over time. On a later graph where you can see that reporting of this data has jumped significantly since GFY2010 (Y-axis — obligation record count):

Caveats: There are Errors in the Data

It is easy to use the pandas max() function to find the $1.9B/year amount and the company. A search on that same company and all of their transactions shows most entries at $0.0 (NaN), although some show $19M/year and one $13M/year. A quick lookup of the SEC.gov data for the company indicates that the $13M-$19M/year range is more accurate.

Any analysis that depends on a highly accurate picture of a specific company’s compensation structure for its top five (5) executives needs a thorough scrub of the data. Since GFY18, GFY19, and now GFY20 (excluded from my CSV since the last 90 days of DoD spending is missing) reporting of this data has improved, you can cross-check the more recent reports with earlier reporting for longitudinal studies of compensation trends.

Narrowing the Focus on a Subset of the Market

Given the focus of the earlier blogs on the product_or_service_codes in the “IT and Telecom” and “Professional Services” segments of the Federal market, I will narrow our review to PSC_Cats “D” and “R”.

I will also narrow the reported compensation to reports between $100K/year and $10M/year. Many records show $1/year or minimal amounts since many firms may assume there is little checking of the data.

With those restrictions, I use the pandas describe() function to create this table:

You can see a significant increase in the obligation transaction records that include this field (GFY19 @ ~447k records vs. GFY10 @ ~26k records). The standard deviations (std) are very high due to some compensation that is many multiples higher than the 50% percentile. Other aspects of the table quantities that are strange such as the 50% percentile column having the same amount for five (5) years in a row. That needs further review to understand why that is.

Focusing on GFY2019 alone, there are 1410 unique recipient_parent_firm names that include highly_compensated_officer amounts > $0.0:

A boxplot depiction with the argument showfliers=False to remove outliers:

I wrote a function earlier in the notebook called Scramble_Company_Exec_Names:

I will use that function on the following tables, so real company names and executive names are not shown in the tables that follow. If you are following on your computer, you will see the actual executive names and their companies. Since most are high profile, a Google search will find references to them in press releases, about pages on the company website, and LinkedIn.

Compensation vs Sum of Prime Obligations

For firms that exclusively do work in the Federal market on contracts reported in USAspending.gov, the total of Prime Obligations should closely approximate their revenues (unless they have significant subcontract revenue). To get a rough idea of the relationship, I sum their prime obligations and collect the executive compensation for GFY19 and display it in this table (NOTE: the names are scrambled using the Scramble_Company_Exec_Names function):

I’ve sorted the table by the sum of federal_action_obligation and scrambled the recipient_parent_names. The count field tells you how many obligation transaction records there were for that firm in GFY19 with mean, standard deviation, minimum, and percentile amounts.

Plots of the Annual Compensation (Y-axis, ordinate) and total federal_action_obligation (x-axis, abscissa). All in subset:

Restricted to > $25m in obligations and < $2B:

Restricted to > $25M in obligations and < $150M:

The vast majority of the compensation values fall between $100k/year and $1M/year, but there are some large outliers even with the obligations limited to firms with < $150M in total obligations for GFY19. There are many possible reasons for the outliers, including data entry errors, the companies having large commercial businesses, mostly subcontracts, the company is highly profitable, or something else. More detailed analysis and cleanup are needed to make sure any further analysis uses accurate data.

Note: the names of the companies and officer_1 are scrambled for this table.

Explore the Relative Compensation of the Top 5 Executives

What is the multiple of the top executive (highly_compensated_exec_1) to the other four (4) executives as a multiple (highly_compensated_officer_N_amount/highly_compensated_officer_1_amount):

Histogram representation of the highly_compensated_officer_1_amount:

Closing Thoughts

I’ve presented an incomplete analysis of the executive compensation data. Still, it should give you a sense of the richness of the USAspending.gov open data source and also some of the data cleanup challenges.

What are the analytical insights one can derive from the data?

For instance, can one develop a good model that estimates a company’s profitability using executive compensation data from here and the SEC.gov filings?

What insights can you derive from the structure of the Top five (5) executive compensation structure?

What can a firm learn about the relative competitiveness of its compensation structure?

What can an individual learn about the compensation opportunity of potential promotions and longitudinal studies of executive moves and compensation changes?

Coming Attractions

In future blog posts, I will explore more analytical and market topics — including time-series forecasting and NLP techniques using USAspending.gov data and mashups with other open data sources.

All of the code listed in the example above can be found here:


MIT Open Source License Copyright 2020 Leif C Ulstrup

Strategy, emerging technology, innovation, and management advisor https://www.primehookllc.com/about-us.html, https://www.american.edu/kogod/faculty/ulstrup.cfm

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store