Skip to content

Latest commit

 

History

History
98 lines (61 loc) · 3.56 KB

README.md

File metadata and controls

98 lines (61 loc) · 3.56 KB

RMIT at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter

Implementation of our system submitted to the "Profiling Fake News Spreaders on Twitter" at PAN @ CLEF 2020

Citation

If you use this resource, please cite our paper:

Xinhuan Duan, Elham Naghizade, Damiano Spina, and Xiuzhen Zhang. 2020. RMIT at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter. In CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020).

BibTeX

@InProceedings{duan2020rmit,
author = {Duan, Xinhuan and Naghizade, Elham and Spina, Damiano and Zhang, Xiuzhen},
title = {{RMIT at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter}},
booktitle = {{CLEF 2020 Labs and Workshops, Notebook Papers}},
year={2020}
}

Data Preparation

Download the files from the github

cd path-to-your-repositrary

Then you are ready for the reproduction of our PAN2020 submission.

  • note: if you only want to use our software you can switch to the software branch, where you can use our software directly after you have done the installation
  • note: your computer should support cuda cudnn, if you haven't install cuda or cudnn, go to the following links:

Installation

python3 -m pip install -r requirement.txt

Build the TLSP model

data preparation

The train data for our task is should not be exposed to the public according to the PAN restrictions, if you want to get access to the data, go to this link : https://pan.webis.de/clef20/pan20-web/author-profiling.html

After downloding the data, put them copy and paste the en and es folder into the relative path /text_classification/data

Train a single model

python3 modelTrainer.py

In the folder you can then see a new file:"BERT-model.pt", which is a tweet-level model which can predict whether a user is a fake news spreader or not.

reproduce a 10-fold validation

Since this project contains of both tweet-level and user level classifiction, So when implementing a 10-fold validation, the data trained on the tweet-level and user-profile level should be the same data. So the 10 fold validation is implemented manually. All the users together with their tweets are divided into 10 folds and they are saved in 10 csv file. The in the file modelTrainer, the method train_model() can do the 10 fold validation.

Build the profile-level model

python3 main.py

The script can produce a confusion matrix with the 10 fold validation result in the paper, and the features extracted from the user-level are already done and written with csv files in the path:

csvs/3rd/user0-user9

and in main.py file line 56 and 57, you can add or delete word in the columns list and you are expected to see the change of the performance

  columns = ['median_score','mean_score','score_std','median_compound','mean_compound','compound_std','emoji','hash',
               'hash_median','hash_std','url','url_median','url_std']

you can edit my useer csv and add your customed value to do the evaluation or you can even use the function in line 58 to add a new data

  columns =assemble(columns,'trump')

Contact

For more information, please contact the first author Xinhuan Duan: [email protected]