Skip to content

Commit

Permalink
Merge pull request #18 from usnistgov/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
garyhowarth authored Jun 19, 2023
2 parents 38f2f58 + fa08a12 commit da1d692
Show file tree
Hide file tree
Showing 25 changed files with 326 additions and 202 deletions.
56 changes: 35 additions & 21 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,21 +1,35 @@
# include
!sdnist/
!sdnist/test/
!sdnist/test/report/
!sdnist/test/report/data/
!sdnist/test/report/data/na2019_1000.csv

# ignore
report.json
**.pyc
**.DS_Store

.ipynb_checkpoints
toy_synthetic_data/
dask-worker-space/
results/
build/
sdnist.egg-info/

**.pkl
build
# include
!sdnist/
!sdnist/test/
!sdnist/test/report/
!sdnist/test/report/data/
!sdnist/test/report/data/na2019_1000.csv

# ignore
report.json
**.pyc
**.DS_Store

.ipynb_checkpoints
toy_synthetic_data/
dask-worker-space/
results/
build/
sdnist.egg-info/

**.pkl
build

**/.idea/
**/crc_acceleration_bundle_1.0/
**/crc_n/
**/crc_notebooks/
**/create_data/
**/data/
**/diverse_communities_data_excerpts/
**/meta_reports/
**/reports/
**/states_puma_geojson/
**/venv/
**/workspace/

2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ abstract: "SDNist provides benchmark data and a suite of both machine- and human
message: >-
If you use this repository or present information about it publicly, please cite us.
type: software
version: 2.2
version: 2.3
doi: 10.18434/mds2-2943
date-released: 2023-4-14
contact:
Expand Down
39 changes: 11 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SDNist v2.2: Deidentified Data Report Tool
# SDNist v2.3: Deidentified Data Report Tool

## [SDNist is the offical software package for engaging in the NIST Collaborative Research Cycle](https://pages.nist.gov/privacy_collaborative_research_cycle)

Expand Down Expand Up @@ -37,7 +37,7 @@ Setting Up the SDNIST Report Tool

### Brief Setup Instructions

SDNist requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend installing v2.2 in a virtual environment. v2.2 can be installed via [Release 2.2](https://github.com/usnistgov/SDNist/releases/tag/v2.2.0) or via the Pypi server: `pip install sdnist` or, if you already have a version installed, `pip install --upgrade sdnist`.
SDNist requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend installing v2.3 in a virtual environment. v2.3 can be installed via [Release 2.3](https://github.com/usnistgov/SDNist/releases/tag/v2.3.0) or via the Pypi server: `pip install sdnist` or, if you already have a version installed, `pip install --upgrade sdnist`.

The NIST Diverse Community Exceprt data will download on the fly.

Expand All @@ -61,13 +61,13 @@ The NIST Diverse Community Exceprt data will download on the fly.
```
4. In the already-opened terminal or powershell window, execute the following command to create a new Python environment. The sdnist library will be installed in this newly created Python environment:
4. In the already-opened terminal or powershell window, execute the following command to create a new Python environment. The sdnist library will be installed in this newly created Python environment:
```
c:\\sdnist-project> python -m venv venv
```
6. The new Python environment will be created in the sdnist-project directory, and the files of the environment should be in the venv directory. To check whether a new Python environment was created successfully, use the following command to list all directories in the sdnist-project directory, and make sure the venv directory exists.
5. The new Python environment will be created in the sdnist-project directory, and the files of the environment should be in the venv directory. To check whether a new Python environment was created successfully, use the following command to list all directories in the sdnist-project directory, and make sure the venv directory exists.
**MAC OS/Linux:**
```
Expand All @@ -78,7 +78,7 @@ The NIST Diverse Community Exceprt data will download on the fly.
c:\\sdnist-project> dir
```
7. Now activate the Python environment and install the sdnist library into it.
6. Now activate the Python environment and install the sdnist library into it.
**MAC OS/Linux:**
```
Expand Down Expand Up @@ -107,27 +107,12 @@ The NIST Diverse Community Exceprt data will download on the fly.
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope LocalMachine
```
8. Per step 5 above, the sdnist-2.2.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
(venv) sdnist-project> ls
```
**Windows:**
```
(venv) c:\\sdnist-project> dir
```
The sdnist-2.2.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
9. Install sdnist Python library:
7. Install sdnist Python library:
```
(venv) c:\\sdnist-project> pip install sdnist
```
10. Installation is successful if executing the following command outputs a help menu for the sdnist.report package:
8. Installation is successful if executing the following command outputs a help menu for the sdnist.report package:
```
(venv) c:\\sdnist-project> python -m sdnist.report -h
```
Expand Down Expand Up @@ -162,8 +147,7 @@ The NIST Diverse Community Exceprt data will download on the fly.
NATIONAL national2019
```
11. These instructions install sdnist into a virtual environment. The virtual environment must be activated (step 9) each time a new terminal window is used with sdnist.
9. These instructions install sdnist into a virtual environment. The virtual environment must be activated (step 9) each time a new terminal window is used with sdnist.
Generate Data Quality Report
Expand Down Expand Up @@ -260,7 +244,7 @@ Setup Data for SDNIST Report Tool
4. You can download the toy deidentified datasets from Github [Sdnist Toy Deidentified Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_deidentified_dataset directory to the sdnist-project directory.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.3.0/diverse_communities_data_excerpts.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
Expand All @@ -287,7 +271,7 @@ by the sdnist.report package to generate a data quality report.
Download Data Manually
----------------------
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip).
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.3.0/diverse_communities_data_excerpts.zip).
3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
Expand All @@ -305,5 +289,4 @@ Credits
- [Christine Task](mailto:[email protected]) - Project technical lead - [email protected]
- [Karan Bhagat](https://github.com/kbtriangulum) - Contributor
- [David Lee](https://www.linkedin.com/in/david-lee-13872922/) - Documentation
- [Gary Howarth](https://www.nist.gov/people/gary-howarth) - Project PI - [email protected]
- [Gary Howarth](https://www.nist.gov/people/gary-howarth) - Project PI - [email protected]
7 changes: 4 additions & 3 deletions nist diverse communities data excerpts/data_dictionary.json
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,13 @@
},
"INDP": {
"description": "Industry codes",
"details": "There are a total of 271 possible codes for INDP, 269 of these codes appear in the Diverse Community Data Excerpts (233 in MA, 264 in Texas and National)",
"link": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2019.pdf"
},
"INDP_CAT": {
"description": "Industry categories",
"values": {
"N": "N/A (less than 16 years old/NILF who last worked more than 5 years ago or never worked)",
"N": "N/A (less than 16 years old, or last worked more than 5 years ago, or never worked)",
"0": "AGR: Agriculture, Forestry, Fishing and Hunting",
"1": "EXT: Mining, Quarrying, and Oil and Gas Extraction",
"2": "UTL: Utilities",
Expand Down Expand Up @@ -160,7 +161,7 @@
"N": "N/A (less than 3 years old)",
"1": "No schooling completed",
"2": "Nursery school, Preschool, or Kindergarten",
"3": "Grade 4 to grade 8",
"3": "Grade 1 to grade 8",
"4": "Grade 9 to grade 12, no diploma",
"5": "High School diploma",
"6": "GED",
Expand All @@ -181,7 +182,7 @@
}
},
"PINCP_DECILE": {
"description": "Person's total income in 10-percentile bins",
"description": "Person's total income rank (with respect to their state) discretized into 10% bins.",
"values": {
"N": "N/A (less than 15 years old",
"9": "90th percentile",
Expand Down
2 changes: 1 addition & 1 deletion sdnist/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
if not name.exists():
print(f"{name} does not exist.")
zip_path = Path(root.parent, 'data.zip')
version = "2.2.0"
version = "2.3.0"

version_v = f"v{version}"
sdnist_version = DEFAULT_DATASET
Expand Down
6 changes: 3 additions & 3 deletions sdnist/metrics/inconsistency.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ def compute(self):
'inconsistency_features': ic_data[2],
'inconsistency_violations': int(ic_data[3].split(' ')[0]),
'inconsistent_data_indexes': ic_dict[i[NAME]],
'inconsistent_record_example': relative_path(row_path)}
'inconsistent_record_example': relative_path(row_path, level=3)}
)

# ------- Compute work-based Inconsistencies------------
Expand Down Expand Up @@ -298,7 +298,7 @@ def compute(self):
'inconsistency_features': ic_data[2],
'inconsistency_violations': int(ic_data[3].split(' ')[0]),
'inconsistent_data_indexes': ic_dict[i[NAME]],
'inconsistent_record_example': relative_path(row_path)}
'inconsistent_record_example': relative_path(row_path, level=3)}
)

# ------- Compute housing-based Inconsistencies------------
Expand Down Expand Up @@ -328,7 +328,7 @@ def compute(self):
'inconsistency_features': ic_data[2],
'inconsistency_violations': int(ic_data[3].split(' ')[0]),
'inconsistent_data_indexes': ic_dict[i[NAME]],
'inconsistent_record_example': relative_path(row_path)}
'inconsistent_record_example': relative_path(row_path, level=3)}
)

# -------- Compute overall stats---------------------
Expand Down
24 changes: 15 additions & 9 deletions sdnist/metrics/pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,10 @@ def compute_pca(self):
t_pca = PCA(n_components=cc)

tdf_v = self.tar.values
sdf = self.syn.apply(lambda x: x - x.mean())
sdf_v = sdf.values

tdf_v = StandardScaler().fit_transform(tdf_v)
sdf_v = StandardScaler().fit_transform(sdf_v)
sdf_v = self.syn.values
scaler = StandardScaler().fit(tdf_v)
sdf_v = scaler.transform(sdf_v)
tdf_v = scaler.transform(tdf_v)

t_pc = t_pca.fit_transform(tdf_v)

Expand All @@ -62,7 +61,7 @@ def compute_pca(self):
self.t_comp_data = []
for i, comp in enumerate(t_pca.components_):
qc = [[n, round(v, 2)] for n, v in zip(self.tar.columns.tolist(), comp)]
qc = sorted(qc, key=lambda x: x[1], reverse=True)
qc = sorted(qc, key=lambda x: abs(x[1]), reverse=True)
qc = [f'{v[0]} ({v[1]})' for v in qc]
self.t_comp_data.append({"Principal Component": f"PC-{i}",
"Features Contribution: "
Expand All @@ -88,7 +87,9 @@ def compute_pca(self):
for c in self.t_pdf.columns:
self.t_pdf_s[c] = min_max_scaling(self.t_pdf[c])
for c in self.s_pdf.columns:
self.s_pdf_s[c] = min_max_scaling(self.s_pdf[c])
self.s_pdf_s[c] = min_max_scaling(self.s_pdf[c],
self.t_pdf[c].min(),
self.t_pdf[c].max())

def plot(self, output_directory: Path) -> Dict[str, any]:
s = time.time()
Expand Down Expand Up @@ -152,8 +153,13 @@ def plot(self, output_directory: Path) -> Dict[str, any]:
return plot_paths


def min_max_scaling(series):
return (series - series.min()) / (series.max() - series.min())
def min_max_scaling(series, min_val=None, max_val=None):
if min_val is None:
min_val = series.min()
if max_val is None:
max_val = series.max()

return (series - min_val) / (max_val - min_val)


def plot_all_components_pairs(title: str,
Expand Down
7 changes: 4 additions & 3 deletions sdnist/metrics/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,11 +216,12 @@ def plots(self) -> List[Path]:
self.report_data = {
"target_counts": relative_path(save_data_frame(self.tcm,
self.o_path,
'target_counts')),
'target_counts'), level=3),
"target_deidentified_counts_difference": relative_path(save_data_frame(self.diff,
self.o_path,
"target_deidentified_counts_difference")),
"target_deidentified_difference_plot": relative_path(file_path),
"target_deidentified_counts_difference"),
level=3),
"target_deidentified_difference_plot": relative_path(file_path, level=3),
"target_regression_slope_and_intercept": (self.t_slope, self.t_intercept),
"deidentified_regression_slope_and_intercept": (self.s_slope, self.s_intercept)
}
Expand Down
7 changes: 4 additions & 3 deletions sdnist/metrics/unique_exact_matches.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from sdnist.report.dataset import Dataset
import sdnist.utils as u


def unique_exact_matches(target_data: pd.DataFrame, deidentified_data: pd.DataFrame):
td, dd = target_data, deidentified_data
cols = td.columns.tolist()
Expand All @@ -18,21 +19,21 @@ def unique_exact_matches(target_data: pd.DataFrame, deidentified_data: pd.DataFr
perc_t_unique_records = round(t_unique_records/td.shape[0] * 100, 2)

# Keep only one copy of each duplicate row in the deidentified data
# and also save the count of each row in the deidentified data
dd= dd.drop_duplicates(subset=cols)
dd = dd.drop_duplicates(subset=cols)

merged = u_td.merge(dd, how='inner', on=cols)

# number of unique target records that exactly match in deidentified data
t_rec_matched = merged.shape[0]

# percent of unique target records that exactly match in deidentified data
perc_t_rec_matched = t_rec_matched/td.shape[0] * 100
perc_t_rec_matched = t_rec_matched/t_unique_records * 100

perc_t_rec_matched = round(perc_t_rec_matched, 2)

return t_rec_matched, perc_t_rec_matched, t_unique_records, perc_t_unique_records


if __name__ == '__main__':
THIS_DIR = Path(__file__).parent
s_path = Path(THIS_DIR, '..', '..',
Expand Down
6 changes: 4 additions & 2 deletions sdnist/report/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@

from sdnist.load import DEFAULT_DATASET


def run(synthetic_filepath: Path,
output_directory: Path = REPORTS_DIR,
dataset_name: TestDatasetName = TestDatasetName.NONE,
data_root: Path = Path(DEFAULT_DATASET),
labels_dict: Optional[Dict] = None,
download: bool = False,
test_mode: bool = False):
show_report: bool = True):
outfile = Path(output_directory, 'report.json')
ui_data = ReportUIData(output_directory=output_directory)
report_data = ReportData(output_directory=output_directory)
Expand Down Expand Up @@ -60,10 +61,11 @@ def run(synthetic_filepath: Path,
ui_data = json.load(f)
log.end_msg()
# Generate Report
generate(ui_data, output_directory, test_mode)
generate(ui_data, output_directory, show_report)
log.msg(f'Reports available at path: {output_directory}', level=0, timed=False,
msg_type='important')


def setup():
bundled_datasets = {"MA": TestDatasetName.ma2019,
"TX": TestDatasetName.tx2019,
Expand Down
Loading

0 comments on commit da1d692

Please sign in to comment.