Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.
Purpose: Demonstrate to executives how easy it is to access and analyze free, open-data sources and motivate them to expect more insight and context from their team when making crucial go/no-go investment decisions on bids and new offerings.
Scope: US Federal Contracting Market.
TL;DR — Analyze GB’s of US Federal spending data with a few lines of Python code
For this fourth blog in my series, I will show you ways to process the 10’s of GB of USAspending.gov data you downloaded on your computer (or even a cluster of computers on your local network or in the cloud) in the previous blog (#3).
Using this code, you can begin extracting analytical insights to inform your business investment decisions for bids and the development of new market offerings.
In future blogs, I will build on this work to demonstrate the analytical insights you can extract from USAspending.gov data to improve your BD, acquisition, and market offering development decisions.
You can find all of the Python code for this series in this GitHub repository: https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series
The jupyter notebook (*.ipynb) Python code examples for this blog and the next (#5) are here:
I am NOT displaying all of the code used to reproduce the outputs in this blog. Look at the jupyter notebook Python code while you review the steps below.
Working with full USAspending.gov annual archives is computing and RAM intensive. An annual archive file typically has 6–7 million records with 280+ fields of text and is about 10GB in size uncompressed. I am using a MacBook 16 (2019) with an 8-core i9 processor and 64 GB of RAM. The examples I show below use both the pandas package for ultra-fast in-memory calculations for sets of data that fit in RAM and the scalable Dask package that rides on top of pandas. Dask enables you to process datasets that far exceed your RAM and use multiple computers in a cluster. The developers of Dask have made it extremely easy to work with a cluster of machines (including cloud-based instances) in what was once a complicated and time-consuming effort.
Setting up your Single-machine Environment
The first step is to import the Python packages you need (you will need to install these using conda or pip if you don’t have them on your machine already).
I created a small function that uses the psutil package to probe and monitor RAM usage and CPU availability inside this package. This function is helpful to have in the Python code since pandas and Dask will need lots of machine resources to process the USAspending.gov annual archives.
Reading the USAspending.gov Government Fiscal Year (GFY) Archive Files
Assuming you used the code from the 3rd blog in this Medium series, the next step identifies the GFY archives that you decompressed (unzipped) in a directory on your machine. Each *.csv file has 1 million records. For instance, if a GFY archive has 6.4 million records, you will see 7 CSV files that start with the FY for that archive after you decompress the archive file.
The code assumes that there are two folders:
(1) One with the GFY archive files (in my example, it is “USAspending_Archive_Downloads”)
(2) A sub-folder called “Expanded_CSV_Files”
Functions to Handle Limitations in the Files
The quality of data in USAspending.gov GFY archive files has improved over time; however, there are known limitations you need to address and also some convenient new columns that I like to add to streamline later analysis. The jupyter file has functions to deal with missing data (aka NaN values) by replacing them with “UNSPECIFIED” and cases where the recipient_parent_name should have changed after acquisitions but has not. You may find others that are useful to you as you become familiar with the data. I recommend that you consolidate those transformations in the Enhance_Spending_File function so you a single place for them and can be sure the transformations happened.
One of the most significant caveats is to make sure you replace NaN values with something. If not, pandas (and Dask) may return unexpected results on groupby transformations of the data. For instance, when computing a groupby to compute the sum of federal_action_obligation (this is the $ value of an obligation record), pandas stops the process when it encounters a NaN, and the dollar totals are incomplete. This issue took me a fair amount of time to figure out when I first started using pandas, and this dataset several years ago. Replacing the NaN with “UNSPECIFIED” for the groupby field solves that. You can choose terms meaningful to you and use different terms for NaN values in other columns (fields).
There are many articles and some key tools for exploring your dataset with pandas. This article from realpython.com is a good place to learn more. One of the most useful pandas tools is the .describe() function that will give you some quick insight into the data in a DataFrame or Series (column). The .value_counts() function is very useful too. Histogram plots and use of “.isnull().values.any()” of df.isnull().sum() can give you a quick sense of the data you are working with.
Loading the Data
The next section of code includes functions to load multiple CSV files into a pandas DataFrame. The Dask tools make it easy to do that using a *.csv wildcard; however, for pandas, you must read each CSV file and append it to the DataFrame. I’ve included two functions to do that, including using the Python multiprocessing package to speed up the process if you are NOT disk-read limited.
Note that I have also created a function called Build_DTypes_Dict. It uses pandas to read the first line of a GFY archive file and get the columns (fields) [see nrows=1]. It forces pandas and Dask to read the data using the object types I have designated rather than relying on it to figure that out on its own. Except for federal_action_obligation as ‘float64’, I have pandas read everything else as an ‘object’ type, essentially str (string) type in Python. The use of declared data types (dtype) speeds up the pd.read_csv function and reduces problems downstream when querying the data; however, it is NOT the most efficient way to read the data. Many of the fields can be read as ‘category’ variables and some as datetime fields. Also, there are other obligation related fields that should be float64. Converting many of the fields with categorical data (e.g., recipient_state, recipient_name, etc.) will dramatically decrease the RAM needed to store the data in-memory.
You can convert columns of a DataFrame to a new type such as ‘category’ using the ‘.astype(category)’ function after reading the data.
However, some issues can crop up in pandas queries with groupby categorical variables. This article at pbpython.com has some handy information on the techniques and performance improvements using categorical variables. It also includes a “Watch Outs” section when working with categorical variables in pandas with a quote “…categorical data can yield some surprising behaviors in real world usage…” This link is another good reference to pandas groupby usage with categorical variables at realpython.com.
I recommend extending the Build_DTypes_Dict function to be the central place to capture all of the datatype transformations in your work, so they happen on reading the CSV file. One of the most common problems is confusion over data types when querying pandas DataFrames when the field can be a string (object) or integer. For instance, selecting all FY after 2015 is different if 2015 is stored as a string (object) vs. an integer — the > will work differently. If you are querying data and not seeing the expected results, this can be the problem.
Using Dask to Load the CSV Files
One of the best advances in using pandas with large datasets is the Dask package. The developers have radically simplified the effort to work with data that does not fit in RAM. They have abstracted the complexity of setting up machine clusters, using GPU cards to offload processing, and using cloud-based resources.
Dask has many similarities with the Apache Spark package; however, Dask works in Python in a native mode and does not require one to set up a Java environment or deal with Java-based warning and error messages. Here is a link to resources to learn more about Dask. You can learn more via @dask_dev and https://github.com/dask.
dask_workers = int(psutil.cpu_count(logical=False)/2) # use half of the available cores to be safeclient = Client(n_workers=dask_workers)
Dask’s Diagnostic Dashboard can be watched in a separate browser window using the Dashboard URL that appears in a window like the one above.
The next step sets up the steps in Dask to read the CSV files.
To return the ddf.head() response, Dask only has to read the first few lines of one CSV file and returns that very quickly (~9s). Scanning through full GFY (with 6 million records or more) will take minutes on a fast machine.
The first test to confirm that it is working is to group the data by GFY and compute the total obligations using federal_action_obligation field. You can Google multiple other sources to ensure you are calculating these correctly. This total is for Federal Contracting (it does not include grants etc.).
The US GAO WatchBlog report on GFY19 contractor spending published 5/26/2020 (using this same data source says “…In FY 2019, the federal government spent more than $586 billion on these contracts, an increase of over $20 billion from FY 2018…”. Our figure is $589B for GFY19 and $555B for GFY18. The difference between 589 vs 586 may be some additional spending included in the file archive I used from August 2020.
Working with GFY 2019
To illustrate how you can move from Dask back to pandas, I will read all of the FY2019 (‘action_date_fiscal_year’ == ‘2019’) into a pandas DataFrame. This process takes about 12 minutes on my machine. Once that completes, the subsequent analysis processing data across the ~6.4 million records is almost instantaneous.
This operation tool almost 12 minutes on my computer; however, most subsequent operations on GFY19 data are computed in 1s or less.
Should you NOT have enough RAM (16GB or more) to do this, you can alter the Python to NOT include the “.compute()” when creating df_subset and then add “.compute()” at the steps below after the groupby actions where I compute the top contractors, top product_or_service categories, etc. I have a test in a cell before I create df_subset to see if you have enough free RAM to load it into memory using pandas instead of Dask. You can experiment with this code. You can also add code downstream that includes the “.compute()” suffice on the df_subset DataFrame, depending on the value of USE_PANDAS.
The pandas “.value_counts()” function is useful in exploring the data. Using that function to compute the contractors with the greatest number of obligation transactions in the data yields the following:
You see firms such as AmerisourceBergen (“…As a single source for tens of thousands of products across human and animal health, we are a key link in the pharmaceutical supply chain…”) and W.W. Grainger (“…Grainger is America’s trusted source for MRO supplies and industrial products…”) which supply consumables and many other high volume products. There are 6,476,878 federal obligation transactions in GFY19 and the #1 at 406,233 transactions accounts for ~6.3% of all transactions. The top 10 total 1,825,964 transactions and represent ~28% of all obligation transactions in GFY19 -> a variation on Pareto’s Principle in action. None of the top 10 firms in the transactions list show up in the top 50 contractors by $ value. The average $/obligation is very low. I didn’t calculate that in the jupyter notebook but you can add that easily.
The first analytical view of the GFY19 data is to compute the Top N contractors by total federal_action_obligation value.
You can compare this list to a press release of a report using the same data source to see how close your numbers are.
The numbers are very close to the same in most cases. They are using the same USAspending.gov source for their calculations. Some of the differences are likely due to M&A not reflected in the USAspending.gov data and manual adjustments by the authors based on their knowledge of M&A or lingering data issues in the data (for instance, after a merger there are often parent name records with old names). Rollout of the GSA Unique Entity Identifier (UEI) would mitigate some of these problems.
For instance, the reference above shows the L3 + Harris merger at #10 vs. this with L-3 #30 at 2.573B and Harris at 1.936B #37 and “L3 Technologies” at 1.654B at #41 => ~$6.2B (missing $1.2B that must be in other L3 or Harris entities further down the list). For the very largest firms, it may take extra work to confirm the corporate hierarchies. Most companies in USAspending.gov have a more straightforward structure and the recipient_parent_name suffices. There may also be differences due to errors (e.g., changes in recipient names) in the USAspending.gov data that must be fixed manually in your Python code. Until GSA rolls out the data’s unique identifiers (UEI), you may have to write some code of your own for entity resolution and track parent/child firm relationships.
As the Dask best practices documentation emphasizes, if you can use pandas directly and do in-memory calculations, you should do that. Since my computer has enough RAM, working with GFY19 in pandas is very fast. For instance, pandas performed a groupby and total calculations for 6,476,878 records in ~2 seconds on my machine CPU times: user 1.5 s, sys: 116 ms, total: 1.61 s Wall time: < 2s.
Analytical Insights about GFY19
One of the patterns you will observe in the data is a ‘power law’ relationship for market share. You might know this as Pareto’s Law (or distribution, principle) or 80/20 Principle. Here is an example:
As you dive into the Department and Agency (even Office level) data, you will see more ‘power law’ examples. As the scale of the Agency gets smaller, being #1 in market share can be worth two or more times the prime obligations that number two receives — capturing market share matters! I will explore this topic more in later blog posts.
This next table illustrates the distribution of spending on GFY19 on the primary Product and Service Codes that the US Government uses to define the dominant type of work executed within a contract. Many contracts for services are a combination of these categories, but the Government will designate one as the primary one).
Note that PSC_Cat R “Support Services (Professional, Administrative, Management)” is #1 with almost 15% of contract spending, R&D #2 with ~8%, and IT and Telecommunications services/solutions #3 at ~7.5%.
For a view of the relative size of Department level spending, I’ve grouped the data by ‘funding_agency_name’.
DoD’s outsize role with 66% of contract spending is evident in this visual.
If you can replicate my results on your machine, you are off to a great start to dive into more sophisticated analytics insights in the data.
In my next blog (link to previous blog), I will explore the second half of the jupyter notebook to illustrate other types of analytics possible using the data. We will explore the CMS data example from the previous posts and look across multiple GFY to identify trends. We will also examine mentions of “AI”, “machine learning”, and “analytics” in award_description and major_program name to see how awards that mention those topics are growing over time.
All of the code listed in the example above can be found here:
MIT Open Source License Copyright 2020 Leif C Ulstrup
About the author:
Leif Ulstrup @lulstrup is the founder of Primehook Technology and an experienced general manager with a proven track record, adapting emerging technology and management innovations to achieve significant business and mission results. His experience spans from the management of large enterprise profit centers to consulting to senior executives faced with critical business and mission transformation challenges.