Premise of this Blog Series: Publicly available “open data” sources and open-source technology analytics can provide information to reduce uncertainty when making critical business investment decisions.
Purpose: Demonstrate to executives how easy it is to access and analyze free, open-data sources and motivate them to expect more insight and context from their team when making crucial go/no-go investment decisions on bids and new offerings.
Scope: US Federal Contracting Market.
In my previous post in this series (#2 Automating the Information Gathering Process), I showed you how to download data using the USAspending.gov API features for a small subset of the database.
In this post, I share code I use that simplifies your periodic download of the updated zipped (*.zip) Government Fiscal Year (GFY) archive files available for download here — https://www.usaspending.gov/download_center/award_data_archive
You can manually download these files or use the code and process below to download, decompress, and organize the archive files.
Per the site, the archive is updated on the 15th of each month:
Typically, the number of changes in the files during a monthly archive posting period declines with the GFY age. That is, the current GFY files are updated frequently, the previous year less often, and so forth.
Depending on the type of analysis you are doing, it is unlikely you will need to keep downloading and decompressing old (3+ years) archive files once they are on your machine though you may miss some changes.
Also, since the archive for the current FY is only updated once per month, you will not have the most up to date obligation transactions from the previous day. If you need the previous day’s obligations through the last archive date information, you should build a custom download via “Award Search/Advanced Search” here — https://www.usaspending.gov/search and then download that file for analysis. Note that the columns and column names may differ from the archive file columns (check that). For instance, a download of Sub-Awards has many fields (columns) unique to that type of award. There is a handy Glossary here -
The daily obligation updates can be very timely for Civilian Agencies; however, it is not unusual for reporting from DoD to be 90-days late (by policy). After the GFY ends on 9/30, most Civilian Agency obligation funding in these download files is close to final with the first month. To analyze DoD obligations, you will not get the full picture until early in the next calendar year.
Let’s Start Downloading the Archive Files
All of the Python code for this example can be found here: https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series/blob/master/Analytics_Edge_Blog_Series__3_Download_a_GFY_github.ipynb
You will need to use the terminal (osx/unix/linux) (or powershell on Windows) application on your computer to get set up but will work with the jupyter application in a web browser such as Safari, Chrome, Explorer, or Firefox.
Once you have the terminal application running, you should navigate to your Documents folder (cd Documents). You will then return to — https://github.com/leifulstrup/USAspending_Medium_Blog_Analytics_Series
and select the green Code button.
The drop-down menu opens…
Select the clipboard icon to the right of the https://github.com/…
Return to the terminal application and enter:
Python For Beginners
Welcome! Are you completely new to programming? If not then we presume you will be looking for information about why…
You will also need to make sure these packages are installed in addition to jupyter:
pip install pandas
pip install requests
You can then launch your jupyter notebook via the command line:
Within the web browser with the jupyter notebook running, you can then navigate to the file you cloned from the github.com repository (“repo”) and open the file. It should look like this:
We can now walk through the cells in the notebook…
The customary first step in Python is to import the packages that we will use in our subsequent steps. In Python, you can import these packages immediately before use or at the start as I have done for ease of maintainability. All of the packages listed are part of the standard packages of Python (“batteries included” philosophy) except for pandas and requests. You may need to pip install both if they are not already on your machine.
The next step captures some vital information about the location of files we will need in later stages.
This step takes you to:
USAspending.gov is the new official source of accessible, searchable and reliable spending data for the U.S…
So you can either manually download each archive for each GFY or use the code I wrote below to handle that for you.
The automated code I wrote follows…
Define some utility functions we will use throughout our code.
We now need to find the latest archive date and enter that by copying and pasting the name of the most recent archive file (mine is for “FY2020_All_Contracts_Full_20200807.zip” — August 7, 2020, yours should be within the last month that you are doing this).
Copy the file name from the page that comes up at : https://www.usaspending.gov/download_center/award_data_archive
Paste that file name in the input prompt that appear in the previous step.
You should see “Filename accepted”, if you copied it correctly (and they have not changed their naming conventions).
I define some helper functions to analyze the archive files. This code depends on the naming convention that the team at USAspending.gov is currently using. Should that change, the code will need to change. Also, writing this as a Python Class might make the approach more robust and easier to maintain should the naming convention change in the future. It works for now.
The next chunk of code looks at what you have already downloaded on your system in the directory structure I am using. If it is your first time to execute the code and you’ve not downloaded any files it will not show anything.
I’ve tested this several times and already have files downloaded in my directory, so it returns this:
From the names of the USAspending.gov GFY archive zip files, the code has successfully parsed that into a column for the GFY, Archive_Date, and the original file name.
The next steps create a folder to move older archives to (“OLD_Files_Delete_If_Not_Needed”) and a folder to store the decompressed zip files from the GFY archives (“Expanded_CSV_Files”). Each GFY Archive will have multiple CSV files after the file is “unzipped”. They have the same name as the archive with a “_1.csv”, “_2.csv”, …, “_n.csv” extension. I create a folder in the original working directory to store and organize them (“Expanded_CSV_Files”).
The next set of Python functions organizes the ZIP and CSV files so that the most recent archive date file is represented, older files are put in the “OLD_FILES_..” folders to be deleted if not needed, and corrupted file names are removed.
The second to final step “Cleanup_Zip_CSV_Directories(download_file_path)” will be used multiple times at later stages to make sure the files are organized. The following message will appear.
The next step constructs a list of URL’s we will use to download the USAspending.gov archive files using the Python requests package.
The following code downloads the USAspending.gov GFY archive files. It first checks to see if the most recent archive file date is more recent than those already stored on the machine. If the download archive date is not newer, than the download is skipped. If it is newer and the file names already exists, it downloads the newer archive files for the current GFY, and the previous two GFY’s under the assumption that the older files that have been downloaded are unlikely to change. You can adjust that using the list slice on line 19.
In my case for the last run on my machine, it skipped the downloads since they were already on my machine. With the output below.
I run this function again to make sure the files are organized.
Since the decompressed files are about 10x (~10GB) the size of the Zip archive files (~1GB), I’ve set this up to selectively decide which GFY you want to decompress. I have FY2018, FY2019, FY2020 listed. You can change that to whatever you want. You will want the current GFY and previous GFY in almost all cases since those will change most frequently.
You can see from the computation in the next cell that the CSV files are ~31GB for just the three most current GFY. I run the clean up function again for safety’s sake.
The next step is the very beginning of the analytics we will dive into with gusto on blog #4 and beyond. These steps below are intended to confirm we can access the CSV files and read them.
We use the Python pandas package (pandas is the Python version of the R programming language with the core concept of a “DataFrame” which is akin to Excel worksheet) to quickly read the first CSV file (~2GB). pandas can quickly and easily digest files much larger than can be ingested in Excel or Googlesheets. We will use it extensively in future blog posts focused on the Analytics and visualization of the data.
The next lines of Python code read the first ten (10) lines of the CSV file with the USAspending.gov transaction level data.
This next line sums the ‘federal_action_obligation” for the ten (10) rows. We will do that type of sum in various pivots of the data in later steps.
There are 282 columns (aka fields) in the data we just read. Here are some of those fields:
A typical GFY has 4 million or more transaction records with 282 fields each, thus the ~10GB per GFY. You can find a data dictionary here: https://www.usaspending.gov/download_center/data_dictionary
Now that we have the files downloaded and have confirmed we can read them, the interesting analytics work can begin (with a bit of data tidying at the start).
My next blog in the series will show you how to use the very powerful Python Dask package to process the files using the multiple processors in your computer (and, as easily, use a network cluster of computers) to analyze the ~100 GB of Federal obligations transaction data going back to GFY 2010.
All of the code listed in the example above can be found here:
MIT Open Source License Copyright 2020 Leif C Ulstrup
About the author:
Leif Ulstrup @lulstrup is the founder of Primehook Technology and an experienced general manager with a proven track record, adapting emerging technology and management innovations to achieve significant business and mission results. His experience spans from the management of large enterprise profit centers to consulting to senior executives faced with critical business and mission transformation challenges.