Skip to content

Commit

Permalink
Merge pull request #14 from usnistgov/develop
Browse files Browse the repository at this point in the history
Develop to main
  • Loading branch information
garyhowarth authored Apr 14, 2023
2 parents f8e5976 + 9c56548 commit 6f04c34
Show file tree
Hide file tree
Showing 22 changed files with 373 additions and 134 deletions.
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# SDNist v2.1: Deidentified Data Report Tool
# SDNist v2.2: Deidentified Data Report Tool

## [SDNist is the offical software package for engaging in the NIST Collaborative Research Cycle](https://pages.nist.gov/privacy_collaborative_research_cycle)

Welcome! SDNist v2.1 is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.
Welcome! SDNist is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.

The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.

Expand Down Expand Up @@ -33,8 +33,10 @@ Setting Up the SDNIST Report Tool

### Brief Setup Instructions

SDNist v2.1 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.1 in a virtual environment. v2.1 can be installed via [Release 2.1](https://github.com/usnistgov/SDNist/releases/tag/v2.1.1). The NIST Diverse Community Exceprt data will download on the fly.

SDNist requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.2 in a virtual environment. v2.2 can be installed via [Release 2.2](https://github.com/usnistgov/SDNist/releases/tag/v2.2.0). The NIST Diverse Community Exceprt data will download on the fly.
```
pip install sdnist
```

### Detailed Setup Instructions

Expand All @@ -53,10 +55,10 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
c:\\sdnist-project>
```
4. Download the sdnist installable wheel (sdnist-2.1.1-py3-none-any.whl) from the Github: [Release 2.1](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/sdnist-2.1.1-py3-none-any.whl).
4. Download the sdnist installable wheel (sdnist-2.2.0-py3-none-any.whl) from the Github: [Release 2.2](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/sdnist-2.2.0-py3-none-any.whl).
5. Move the downloaded sdnist-2.1.1-py3-none-any.whl file to the sdnist-project directory.
5. Move the downloaded sdnist-2.2.0-py3-none-any.whl file to the sdnist-project directory.
6. Using the terminal on Mac/Linux or powershell on Windows, navigate to the sdnist-project directory.
Expand Down Expand Up @@ -109,7 +111,7 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
```
10. Per step 5 above, the sdnist-2.1.1-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
10. Per step 5 above, the sdnist-2.2.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
Expand All @@ -119,12 +121,12 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
```
(venv) c:\\sdnist-project> dir
```
The sdnist-2.1.1-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
The sdnist-2.2.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
11. Install sdnist Python library:
```
(venv) c:\\sdnist-project> pip install sdnist-2.1.1-py3-none-any.whl
(venv) c:\\sdnist-project> pip install sdnist-2.2.0-py3-none-any.whl
```
Expand Down Expand Up @@ -202,7 +204,7 @@ Generate Data Quality Report
```
(venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL
```
6. SDNist 2.1 allow users to add labels for the deidentified dataset used to generate report:
6. Starting from version 2.1, SDNist allow users to add labels for the deidentified dataset used to generate report:
* To add single string label to the report, use command line option **--labels** followed by a string as given in the following example command:
```
(venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL --labels used_epsilon_1
Expand Down Expand Up @@ -243,7 +245,7 @@ Setup Data for SDNIST Report Tool
(venv) c:\\sdnist-project> python -m sdnist.report syn_tx.csv TX
Downloading all SDNist datasets from:
https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_data_excerpts.zip ...
https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip ...
...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed
```
Expand All @@ -261,7 +263,7 @@ Setup Data for SDNIST Report Tool
4. You can download the toy deidentified datasets from Github [Sdnist Toy Deidentified Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_deidentified_dataset directory to the sdnist-project directory.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
Expand All @@ -288,7 +290,7 @@ by the sdnist.report package to generate a data quality report.
Download Data Manually
----------------------
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_data_excerpts.zip).
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip).
3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
Expand Down
3 changes: 2 additions & 1 deletion sdnist/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
if not name.exists():
print(f"{name} does not exist.")
zip_path = Path(root.parent, 'data.zip')
version = "2.1.1"
version = "2.2.0"

version_v = f"v{version}"
sdnist_version = DEFAULT_DATASET
Expand Down Expand Up @@ -124,6 +124,7 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
print()
copy_from_path = str(Path(extract_path, sdnist_version))
copy_to_path = str(Path(root))
print(f"Copying {copy_from_path} to {copy_to_path} ...")
copy_tree(copy_from_path, copy_to_path)
shutil.rmtree(extract_path)
else:
Expand Down
3 changes: 3 additions & 0 deletions sdnist/metrics/apparent_match_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@


def cellchange(df1, df2, quasi, exclude_cols):
# use drop duplicates with keep argument as False,
# to retain only those records that occur only
# once in the data.
uniques1 = df1.drop_duplicates(subset=quasi, keep=False)
uniques2 = df2.drop_duplicates(subset=quasi, keep=False)
matcheduniq = uniques1.merge(uniques2, how='inner', on=quasi)
Expand Down
6 changes: 4 additions & 2 deletions sdnist/metrics/pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,10 @@ def plot(self, output_directory: Path) -> Dict[str, any]:
plot_paths[strs.HIGHLIGHTED][(h_type, h_name, h_caption)] = \
[h_tar_path, h_syn_path]

e = time.time() - s
# print('PCA TOOK TIME: ', e)
# clear temporary data from report data
remove_path(t_cp_o_path)
remove_path(s_cp_o_path)

return plot_paths


Expand Down
12 changes: 6 additions & 6 deletions sdnist/metrics/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,13 @@ def compute(self):
self.diff = self.tcm - self.scm

# calculate regression lines for target and synthetic data
if self.ts.shape[0] > 1:
self.t_reg = stats.linregress(self.ts[xc], self.ts[yc])
if self.ts.shape[0] > 1 and len(self.ts[xc].unique()) > 1:
self.t_reg = stats.linregress(self.ts[xc].astype(float), self.ts[yc].astype(float))
self.t_slope = round(self.t_reg.slope, 2)
self.t_intercept = round(self.t_reg.intercept, 2)

if self.ss.shape[0] > 1:
self.s_reg = stats.linregress(self.ss[xc], self.ss[yc])
if self.ss.shape[0] > 1 and len(self.ss[xc].unique()) > 1:
self.s_reg = stats.linregress(self.ss[xc].astype(float), self.ss[yc].astype(float))
self.s_slope = round(self.s_reg.slope, 2)
self.s_intercept = round(self.s_reg.intercept, 2)

Expand Down Expand Up @@ -186,10 +186,10 @@ def plots(self) -> List[Path]:

r_tx_df = pd.DataFrame([[_ + 0.5, self.t_intercept + self.t_slope * (_ + 0.5)]
for _ in tx], columns=['x', 'y'])
r_tx_df = r_tx_df[r_tx_df['y'] >= 0]
r_tx_df = r_tx_df[(r_tx_df['y'] >= 0) & (r_tx_df['y'] <= 10)]
r_sx_df = pd.DataFrame([[_ + 0.5, self.s_intercept + self.s_slope * (_ + 0.5)]
for _ in tx], columns=['x', 'y'])
r_sx_df = r_sx_df[r_sx_df['y'] >= 0]
r_sx_df = r_sx_df[(r_sx_df['y'] >= 0) & (r_sx_df['y'] <= 10)]

ax0.plot(r_tx_df['x'],
r_tx_df['y'], color='red', label='Target')
Expand Down
44 changes: 44 additions & 0 deletions sdnist/metrics/unique_exact_matches.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import pandas as pd
from pathlib import Path

from sdnist.load import TestDatasetName

from sdnist.report.dataset import Dataset
import sdnist.utils as u

def unique_exact_matches(target_data: pd.DataFrame, deidentified_data: pd.DataFrame):
td, dd = target_data, deidentified_data
cols = td.columns.tolist()

# select rows that are unique in the target data
u_td = td.loc[td.groupby(by=cols)[cols[0]].transform('count') == 1, :]

# target unique records
t_unique_records = u_td.shape[0]
perc_t_unique_records = round(t_unique_records/td.shape[0] * 100, 2)

# Keep only one copy of each duplicate row in the deidentified data
# and also save the count of each row in the deidentified data
dd= dd.drop_duplicates(subset=cols)

merged = u_td.merge(dd, how='inner', on=cols)

# number of unique target records that exactly match in deidentified data
t_rec_matched = merged.shape[0]

# percent of unique target records that exactly match in deidentified data
perc_t_rec_matched = t_rec_matched/td.shape[0] * 100

perc_t_rec_matched = round(perc_t_rec_matched, 2)

return t_rec_matched, perc_t_rec_matched, t_unique_records, perc_t_unique_records

if __name__ == '__main__':
THIS_DIR = Path(__file__).parent
s_path = Path(THIS_DIR, '..', '..',
'toy_synthetic_data/syn/sdcmicro/k_ano_k_6.csv')
log = u.SimpleLogger()
dataset_name = TestDatasetName.national2019
d = Dataset(s_path, log, dataset_name)

unique_exact_matches(d.c_target_data, d.c_synthetic_data)
28 changes: 15 additions & 13 deletions sdnist/report/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# SDNist v2.1: Deidentified Data Report Tool
# SDNist v2.2: Deidentified Data Report Tool

## [SDNist is the offical software package for engaging in the NIST Collaborative Research Cycle](https://pages.nist.gov/privacy_collaborative_research_cycle)

Welcome! SDNist v2.1 is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.
Welcome! SDNist is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.

The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.

Expand All @@ -27,8 +27,10 @@ Setting Up the SDNIST Report Tool

### Brief Setup Instructions

SDNist v2.1 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.1 in a virtual environment. v2.1 can be installed via [Release 2.1](https://github.com/usnistgov/SDNist/releases/tag/v2.1.1). The NIST Diverse Community Exceprt data will download on the fly.

SDNist requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.2 in a virtual environment. v2.2 can be installed via [Release 2.2](https://github.com/usnistgov/SDNist/releases/tag/v2.2.0). The NIST Diverse Community Exceprt data will download on the fly.
```
pip install sdnist
```

### Detailed Setup Instructions

Expand All @@ -47,10 +49,10 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
c:\\sdnist-project>
```
4. Download the sdnist installable wheel (sdnist-2.1.1-py3-none-any.whl) from the Github: [Release 2.1](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/sdnist-2.1.1-py3-none-any.whl).
4. Download the sdnist installable wheel (sdnist-2.2.0-py3-none-any.whl) from the Github: [Release 2.2](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/sdnist-2.2.0-py3-none-any.whl).
5. Move the downloaded sdnist-2.1.1-py3-none-any.whl file to the sdnist-project directory.
5. Move the downloaded sdnist-2.2.0-py3-none-any.whl file to the sdnist-project directory.
6. Using the terminal on Mac/Linux or powershell on Windows, navigate to the sdnist-project directory.
Expand Down Expand Up @@ -103,7 +105,7 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
```
10. Per step 5 above, the sdnist-2.1.1-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
10. Per step 5 above, the sdnist-2.2.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
Expand All @@ -113,12 +115,12 @@ SDNist v2.1 requires Python version 3.7 or greater. If you have installed a prev
```
(venv) c:\\sdnist-project> dir
```
The sdnist-2.1.1-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
The sdnist-2.2.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
11. Install sdnist Python library:
```
(venv) c:\\sdnist-project> pip install sdnist-2.1.1-py3-none-any.whl
(venv) c:\\sdnist-project> pip install sdnist-2.2.0-py3-none-any.whl
```
Expand Down Expand Up @@ -196,7 +198,7 @@ Generate Data Quality Report
```
(venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL
```
6. SDNist 2.1 allow users to add labels for the deidentified dataset used to generate report:
6. Starting from version 2.1, SDNist allow users to add labels for the deidentified dataset used to generate report:
* To add single string label to the report, use command line option **--labels** followed by a string as given in the following example command:
```
(venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL --labels used_epsilon_1
Expand Down Expand Up @@ -237,7 +239,7 @@ Setup Data for SDNIST Report Tool
(venv) c:\\sdnist-project> python -m sdnist.report syn_tx.csv TX
Downloading all SDNist datasets from:
https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_data_excerpts.zip ...
https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip ...
...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed
```
Expand All @@ -255,7 +257,7 @@ Setup Data for SDNIST Report Tool
4. You can download the toy deidentified datasets from Github [Sdnist Toy Deidentified Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_deidentified_dataset directory to the sdnist-project directory.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
5. Each toy deidentified dataset file is generated using the [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
Expand All @@ -282,7 +284,7 @@ by the sdnist.report package to generate a data quality report.
Download Data Manually
----------------------
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.1.1/diverse_communities_data_excerpts.zip).
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Communities Data Excerpts](https://github.com/usnistgov/SDNist/releases/download/v2.2.0/diverse_communities_data_excerpts.zip).
3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
Expand Down
Loading

0 comments on commit 6f04c34

Please sign in to comment.