Skip to content

SIRAD: Secure Infrastructure for Research with Administrative Data

License

Notifications You must be signed in to change notification settings

achillesrasquinha/sirad

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Secure Infrastructure for Research with Administrative Data (SIRAD)

sirad is an integration framework for data from administrative systems. It deidentifies administrative data by removing and replacing personally identifiable information (PII) with a global anonymized identifier, allowing researchers to securely join data on an individual from multiple tables without knowing the individual's identity. It is developed by Research Improving People's Lives (RIPL).

For a worked example using synthetic data, please see sirad-example.

More detailed documentation of the sirad configuration file and layout file formats is available in the wiki.

To learn more about the motivation for creating this package and its potential uses, please see our articles in Communications of the ACM and Software Impacts:

J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. (2019). Unlocking Data to Improve Public Policy. Communications of the ACM 62(10): 48-53. doi:10.1145/3335150

M. Howison, M. Goggins. (2022). SIRAD: Secure Infrastructure for Research with Administrative Data. Software Impacts 12: 100245. doi:10.1016/j.simpa.2022.100245

Installation

Requires Python 3.7 or later.

To install from PyPI using pip:
pip install sirad

To install a development version from the current directory:
pip install -e .

Running

There is a single command line script included, sirad.

sirad supports the following arguments:

  • process - split raw data files into data and PII files
  • research - create a versioned set of research files with a unique anonymous identifier

Configuration

To set configuration options, create a file called sirad_config.py and place either in the directory where you are executing the sirad command or somewhere else on your Python path. See _options in config.py for a complete list of possible options and default values.

The following options are available:

  • DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None.

  • PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None.

  • LAYOUTS: directory that contains layout files. Defaults to layouts/.

  • RAW_DIR, DATA_DIR, PII_DIR, LINK_DIR, RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved.

  • VERSION: the current version number of the processed and research files.

Layout files

sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed. More documentation to come on this YAML format.

The following file formats are supported:

  • csv - change delimiter with delimiter option
  • fixed with
  • xlsx (xls not currently supported)

Development

Sample test data is randomly generated using Faker; none of the information identifies real individuals.

  • tax.txt - sample tax return data. Includes first, last, DOB and SSN.
  • credit_scores.txt - sample credit score information. Includes first, last and DOB but no SSN.

Run unit tests as:

python -m unittest discover

Contributors

  • Mark Howison
  • Ted Lawless
  • John Ucles
  • Preston White
  • Marcelle Goggins

About

SIRAD: Secure Infrastructure for Research with Administrative Data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%