Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
wisepythagoras committed Feb 9, 2020
0 parents commit ceb4f03
Show file tree
Hide file tree
Showing 21 changed files with 1,030 additions and 0 deletions.
103 changes: 103 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
*.pcap

21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2017 Constantine Apostolou

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Website Fingerprinting

Website fingerprinting is a method of Tor or VPN packet inspection that aims to collect enough features and information from individual sessions that could aid in identifying the activity of anonymized users.

## Introduction

For this experiment Tor is required. It can be installed by running the following commands:

``` bash
# For Debian or Ubuntu
sudo apt install tor lynx

# For Fedora
sudo yum install tor lynx
```

By installing Tor we also get a program called `torsocks`; this program will be used to redirect traffic of common programs through the Tor network. For example, it can be run as follows:

``` bash
# SSH through Tor.
torsocks ssh [email protected]

# CUrl through Tor.
torsocks curl -L http://httpbin.org/ip

# Etc...
```

### Required Python Modules

``` bash
pip install sklearn dpkt
```

## Data Collection

For the data collection process two terminal windows in a side-by-side orientation are required, as this process is fairly manual. Also, it's advised to collect the fingerprints in a VM, in order to avoid caputring any unintended traffic. To listen on traffic there exists a script, namely [capture.sh](pcaps/capture.sh), which should be run in one of the terminals:

``` bash
./pcaps/capture.sh duckduckgo.com
```

Once the listener is capturing traffic, on the next terminal run:

``` bash
torsocks lynx https://duckduckgo.com
```

Once the website has finished loading, the capture process needs to be killed, along with the browser session (by hitting the `q` key twice). The process should be repeated several times for each web page so that there is enough data.

## Machine Learning

[Scikit Learn](http://scikit-learn.org/stable/) was used to write a [k Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification) classifier, that would read the pcap files, as specified in the [config.json](config.json) file. `config.json` can be changed according to which webpages were targeted for training. The training script is [gather_and_train.py](gather_and_train.py).

<p align="center">
<img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_classification_0021.png" alt="Scikit Learn kNN" />
</p>

## Classifying Unknown Traffic

Once the training is done, and the `classifier-nb.dmp` is created, the [predict.py](predict.py) script can be run with the pcap file as the sole argument. The script will load the classifier and attempt to identify which web page the traffic originated from.

It is worth noting that from each sample only the first 40 packets will be used to train a usable model and to run through the resulting classifier.

<p align="center">
<img src="graphs/graph-screenshot.png" alt="Visualizing the patterns" />
</p>

As it can be seen in the screenshot above, the patterns of the packets of each website can be seen clearly on a 3D scale. The classifier visualizes the data in a similar way and gives us the most accurate result.

An interactive version of this graph can be found in the [graphs](graphs) folder.

## Limitations and Disclaimers

This setup was created in order to research the topic of website fingerprinting and how easy it is to attempt to deanonymize users over Tor or VPNs. Traffic was captured and identified in a private setting and for purely academic purposes; use of this source code is intended for those reasons only.

Traffic is never "clean", as the assumption was - for simplicity - in this reasearch. However, if an entity has enough resources, the desired anonymized traffic can be isolated and fed into this simple classifier. This means that it is entirely possible to use a method like this to compromise anonymized users.

## References

1. Wang, T. and Goldberg, I. (2017). Website Fingerprinting. [online] Cse.ust.hk. Available at: https://www.cse.ust.hk/~taow/wf/.
2. Wang, T. and Goldberg, I. (2017). Improved Website Fingerprinting on Tor. Cheriton School of Computer Science. Available at: http://www.cypherpunks.ca/~iang/pubs/webfingerprint-wpes.pdf
3. Wang, T. (2015). Website Fingerprinting: Attacks and Defenses. University of Waterloo. Available at: https://uwspace.uwaterloo.ca/bitstream/handle/10012/10123/Wang_Tao.pdf

Binary file added classifier-nb.dmp
Binary file not shown.
11 changes: 11 additions & 0 deletions config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"pcaps": [
"duckduckgo.com",
"github.com",
"jjay.cuny.edu",
"telegram.org",
"reddit.com",
"torproject.org",
"perdu.com"
]
}
97 changes: 97 additions & 0 deletions defense.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
"use strict";

const TorAgent = require("toragent");
const request = require("request");
const mysources = require("./mysources.json");

let websites = mysources.sources;
let agent = null;

/**
* Gets some random URLs from articles.
* @param {function} callback This is called when this is done.
*/
function getNews(callback) {
let newsUrl = "https://newsapi.org/v2/everything?q=snowden&sortBy=publishedAt&apiKey=";
newsUrl += mysources.newsapi_key;

// Get the articles.
get(newsUrl, (error, results) => {
if (!error) {
// Parse the results.
results = JSON.parse(results);

// Add the articles to the list.
for (let i = 0; i < results.articles.length; i++) {
websites.push(results.articles[i].url);
}
} else {
throw(error);
}

callback();
});
}

/**
* Connects to the Tor network.
* @param {function} callback The callback function that's called once we're
* connected to the network.
*/
function connect(callback) {
console.log("Getting new identity");

TorAgent.create(false, function(error, newAgent) {
if (error) {
// Unable to connect to the Tor network
throw(err);
}

agent = newAgent;
callback();
});
}

/**
* Creates a simple HTTP GET request.
* @param {string} url The URL to get.
* @param {function} callback The function called once the request is done.
*/
function get(url, callback) {
request({
url: url,
agent: agent,
rejectUnauthorized: false,
}, function(err, res, body) {
if(err) {
websites.splice(websites.indexOf(url), 1);
callback(err, false);
} else {
callback(null, body);
}
});
}

/**
* Gets a random website from the list.
* @returns {string} A URL of a page to load.
*/
function getRandomPage() {
let len = websites.length;
return websites[Math.round(Math.random() * len) % len];
}

/**
* Generates random traffic.
*/
function loadRandomUrl() {
let page = getRandomPage();
get(page, (err) => console.error(err ? err : `GET: ${page}`));

setTimeout(loadRandomUrl, Math.round(Math.random() * 800) % 800);
}

// Connect to Tor and begin jamming the session.
connect(() => {
getNews(loadRandomUrl);
});
Loading

2 comments on commit ceb4f03

@KritiJethlia
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I had a doubt about your data collection method.
Why are the websites loaded using Lynx and not just Curl ?
Had they been loaded using curl will it have some impact on the captured data?

@wisepythagoras
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the websites loaded using Lynx and not just Curl ?

@KritiJethlia lynx here can be replaced by curl. It can also be replaced by Firefox or Chrome or anything that can load a web page.

I picked Lynx because it gave me a browser environment in the terminal. This means that it loads - or at least tries to load - most assets on the page, as opposed to just getting the html of the landing page. At the end of the day we found that this method gave us the most accurate results.

Please sign in to comment.