Skip to content

Latest commit

 

History

History
31 lines (23 loc) · 1.87 KB

README.md

File metadata and controls

31 lines (23 loc) · 1.87 KB

README: Ground Truth Labeling

Files

ground_truth_runner.py

Requires label_ground_truth.py.

Iterates over past crawl data databases in /crawl folder, and labels positive, negative, and unknown cookie matching instances for each crawl by calling label_ground_truth.py.

To run ground_truth_runner.py: ground_truth_runner.py [-h] --par | --no-par [--progress-bar | --no-progress-bar] [-v {0,1,2}]

Typical usage: ground_truth_runner.py --par

label_ground_truth.py

Iterates over inputted crawl database. From individual redirect rows (graph edge), labels redirect as positive, negative, or unknown for cookie matching. Returns these labels and their respective domains to ground_truth_runner.py.

This file is not intended to be used directly.

Papadapolous Cookie Synchronization Method

Paper

  1. Extract all browser cookies set, via openWPM javascript_cookies table
    • Filter out session cookies (cookies without expiration date)
    • Parse cookie values using common delimiters (:, &)
  2. Detect possible cookie_id sharing events in the http_redirects table
    • Identify ID-looking strings (> 10 alphanumeric) in:
      • requested redirect parameters
      • requested redirect path
      • requested redirect location header.
    • If this ID is seen for the first time, store in hashtable with URL's domain. If this ID has been seen before, consider it as a shared ID, and the requests carrying it as ID-sharing requests.
    • Use entity_map.json to determine organizations of domains, to discriminate between intentional ID leaking, and internal ID sharing (avoid false-positives).
  3. A detected shared ID is considered a cookie sync if the shared ID matches an extracted browser cookie from the first step.