Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.
Purpose: Demonstrate to executives how easy it is to access and analyze free, open-data sources and motivate them to expect more insight and context from their team when making crucial go/no-go investment decisions on bids and new offerings.
Scope: US Federal Contracting Market.
TL;DR — Analyze GB’s of US Federal spending data with a few lines of Python code
For this fifth blog in the series, I will show you some examples of the information now at your fingertips and preview some analytics. I will dive deeper into analytical techniques in subsequent posts.
In the previous posts, I demonstrated how to access USAspending.gov data and start processing it on your computer (or cloud-based resources). Links to earlier blogs in this series:
Building on Blog #4 jupyter Notebook Python Code
You will need to do some of the same set up at the start of the notebook we used in my Medium post #4 USAspending.gov Analytics — Loading Data into Dask and pandas to access the USAspending.gov data you downloaded and set up Dask and pandas to process it.
The great thing about jupyter notebooks is the ease of repeating the previous steps and building on your work.
Launch the jupyter notebook in the terminal application on your computer and then open the file “Analytics_Edge_Blog_Series_4_Processing_USA_Spending_GFY_Archive_Files_V1pt0_github.ipynb”.
Step your way through the notebook until you get to the section entitled: “Begin Medium Blog #5”. We will pick up the analysis from here.
In the first blog in this series (see the section Starting with an Example), we imagined that we are a commercial provider of insurance fraud detection software interested in entering the Federal market. We have a hunch and some insight from a former customer that CMS at HHS may want to expand their fraud detection and “program integrity” investments.
We will continue with that scenario and then demonstrate how to use the same analytical techniques to explore other Federal Government buying patterns for subjects such as “AI,” “machine learning”, “predictive analytics”, etc.
Start of Analysis — Focus on CMS Spending
We will use the Dask DataFrame we created earlier in the notebook to select records across Government Fiscal Years (GFY) for the Center for Medicare and Medicaid Services (CMS) Agency inside the Department of Health and Human Services (HHS).
Since I restricted Dask to read GFY 2017–2019, I will only see those GFY for CMS. We can adjust that earlier in the notebook and retrieve more years. The notebook reports:
It took more than 18 minutes to read through the 30GB of data in the three GFY and extract the CMS data. A faster and more manual process would be to use the Advanced Search Feature of USAspending.gov and select that data. Since those files are relatively small (3 years are ~24MB), you can readily read them into a pandas DataFrame directly (using pd.read_csv). I did not do that here since I want to demonstrate the general concept of how to extract a subset across GFY for further analysis.
Now that we have a pandas DataFrame of the CMS spending across the three GFY, we can quickly compute some summary information. For example:
We can see that the number of transaction-level contract obligations for CMS ranged from 3302 in GFY2018 to 3827 in GFY2019. Something small enough to analyze in an Excel spreadsheet. This approach might be overkill, but you will see the advantages as we look across larger cross-sections of 10GB/GFY.
What are the top 5 types of products and services that CMS obligated for in GFY19? In GFY2019, it spent ~$7.2B on the following broad categories (just showing the top 5):
Who were the top 10 contractors at CMS in GFY19?
How has the market share and total obligations changed for the GFY2019 #1 contractor since GFY2017?
You can see that they increased their market share from ~12.7% of total CMS spending to ~14% and increased the annual contract prime dollars by $1.022B — $0.863B => +$159/year. Some of that was from acquisitions of companies, as reported in the industry press. Knowing that an analyst could dive deeper and estimate how much growth was organic vs. acquisition.
It is easy to plot the trend:
The previous blog post discussed how market share patterns in Federal obligations tend to follow a ‘power law’ (also known as Pareto or 80/20). You can see that pattern here:
A quick calculation above shows that being #1 is worth 2.49 times the prime federal obligations of #2 and ~$611/year in additional funding. 14% market share vs. ~6%. In future blog posts, I will explore market share stability are over a decade within an Agency.
Fraud Detection and Prevention Spending at CMS
Now that we have a big picture view of CMS, we shift back to the scenario and topic of exploring ‘fraud’ prevention. Which firms have been the top 5 contractors for that topic from GFY2017-GFY2019?
Which were in the Top 10 for this topic in GF2019:
What can we learn about the full scope of work performed across ALL Federal Agencies from GFY1-GFY19?
Across all of the Federal market from GFY17-GFY19, where else are these contractors doing work where the terms ‘fraud’ or ‘program integrity’ is in the ‘award_description’ field?
If you are the provider of insurance fraud prevention technology, where else can you find the term ‘fraud’ in the award_description field of obligations in other Agencies?
From this analysis, you can see that HHS/CMS spending where the word ‘fraud’ or ‘program integrity’ appears is much larger than all other agencies listed. That does NOT mean there is not fraud prevention work at scale embedded in other programs. The award_description field is concise and usually only includes the major items being purchased. Full text searches of the beta.sam.gov (former fedbizopps.gov) site (awarded and contract opportunities) is necessary to explore a topic like this in-depth. A quick search of the term in beta.sam.gov under ‘All Award Data’ shows that many of the opportunities are funded through Grants and NOT contracts. For example:
Grant information is available through USAspending.gov but not in the datasets I downloaded for my examples. Instead of selecting the data archive for Award Type “Contracts” on the USAspending.gov Download Center Archives, you need to select Award Type “Financial Assistance”.
Since this example is for a company that wants to offer its services to the Federal Government, they need to know when some of the existing contracts will be up for recompete. Here is an example of how to begin that analysis:
Sites such as fpds.gov (which will soon move to beta.sam.gov on 10/17/2020) can help you explore those contracts using the award_id_piid and parent_ award_id_piid.
I used the convenient pandas df.to_csv(filename, index=False) command to download the data into a file. That CSV file is easily imported into Excel, Google Sheets, or Numbers for further analysis. You can do this at any point and either read the data back in or switch to another analytical tool you prefer.
Spending on AI Growing Across the Federal Market
Now that we have re-explored this topic over three GFY, let’s pivot to using the same technique to look at a concept and term that is frequently in the headlines — Artificial Intelligence or AI.
I first assemble a list of terms often associated with the concept of AI but are not exhaustive. Also, it will not handle cases where there are typos or related concepts. We will need to use a more sophisticated set of Natural Language Processing (NLP) tools to search for a broader set of concepts associated with AI and even handle typos. I’ll save that for a later blog post. Here is the code I used:
The Python code above executes another search across the entire three GFY CSF file subsets we have been working with it. Selecting this subset of data took Dask almost 23 minutes on my laptop (~30GB of CSV data scanned). If you have access to a cluster of machines on your local network or in the cloud, Dask can process this data in parallel and significantly shorten the elapsed time. I’ll explore some options for that in future posts if there is interest.
We now have a DataFrame called ddf_Contractor_Explore_AI that we can explore. We run that through a custom function explained in Blog #4 called Create_obligations_DF and see the following (excerpt):
The full table has 417 rows.
What are the spending trends for AI-related obligation awards?
The table and graph show that GFY19 AI-related spending is 50% greater than GFY17. The GFY19 total in the table is likely only a small fraction of all AI-related spending. AI technology is used in a broad spectrum of programs where it is included in the request for proposal (RFP), statement of work (SOW), or statement of objectives (SOO) but NOT listed in the short award_description. Understanding the Federal AI market’s full scale requires analysis of other sources; however, the significant YoY growth trend is very high relative to overall Federal Agency YoY budgets growth in the single-digit percentages.
Even with those caveats, it gives a business developer, account manager, or industry analyst a sense of the trends for increased spending and/or labeling work as AI.
If these growth trends continue, what can we expect for GFY2020? GFY2021?
Since HHS/CMS shows up as the highest $ value of spending on AI in GFY19, we can dive into the details.
The preceding table shows that ‘AI’ or ‘artificial intelligence’ does NOT show up in the CMS award_descriptions field, but ‘analytics’ and ‘business intelligence’ do. Though those concepts are not as cutting-edge as AI, they are the foundations in data management and modeling that often precede AI-efforts. Below is a quick graph highlighting the jump from GFY17 to GFY18.
Which are the leading contractors doing the analytics and business intelligence work in GFY19?
I’ve added a Python assert statement to make sure nothing was lost in the various grouping and summation code to ensure the totals are the same. Python assert statements intermixed in your Python code can help you test for and identify problems long before you launch a lengthy set of computations or sequence of data transformations. I’ve learned this the hard way. You don’t have to.
I hope the above inspires you to use this Python jupyter notebook to start your exploration of USAspending.gov data and develop your queries and analytical reports.
In future blog posts, I will explore more analytical and market topics — including time-series forecasting and NLP techniques using USAspending.gov data and mashups with other open data sources.
All of the code listed in the example above can be found here:
MIT Open Source License Copyright 2020 Leif C Ulstrup
About the author:
Leif Ulstrup @lulstrup is the founder of Primehook Technology and an experienced general manager with a proven track record, adapting emerging technology and management innovations to achieve significant business and mission results. His experience spans from the management of large enterprise profit centers to consulting to senior executives faced with critical business and mission transformation challenges.