This repository contains the source code for the final project for TESI. The following dependencies are required for it to run as expected, assuming that pip is the preferred package manager for Python:
pip install numpy scipy scikit-learn scrapy nltk bottle
This platform is divided into relatively independent modules, which are called in sequence to provide the language transformation functionality. The main script is simplify_web.py:
./simplify_web <url>
Source: classifier/word_classifier.py
Evaluate and train classifiers for different feature extraction setups, generating a file that can be imported from Python as a ready-to-use classifier. This way, the classifier is trained only once and it can be used from other programs as long as the input conforms to its expectations. This trained complex word classifier will be used in step 2.
Source: src/crawler/mashable_crawler.py
Extract plain text from HTML articles on the web, as well as other elements that help keeping the site's experience (e.g. images, styles). Plain text (split into sentences) goes directly to step 2 to be processed, other elements go to step 4 as there is no need to transform or process them.
Source: src/classifier/word_classifier.py
Use the trained complex word classifier with a sentence as input and return a list of complex words on the sentence.
Source: src/synonyms/synonym_replace.py
From a sentence, a list of complex words and their respective start and end positions in the sentence, map each complex word with a simple synonym and then replace each occurrence in the sentence.
Source: src/website/web_generator.py
Create a web page using text output from step 3 and other elements from step 1 to keep the site's experience. Open a web browser tab with this page, using (if possible) the default program for this task.
To install the Google Chrome extension, go to chrome://extensions. Then, enable the developer mode, and select "Install unpackaged extension".
Source: src/web_service
When the file browser opens, navigate to "src/web_service" and click Accept.
Then, for the extension to work, you must keep the following script running - src/server.py, as it will be listening to the browser extension and will be in charge of opening the simplified web once generated.
./server.py
Note that this script is just a wrapper around the command line interface.