This project is the first attempt to do Political Bias classification of German news.
Check out paper: Fine-grained Classification of Political Bias in German News: A Data Set and Initial Experiments
We crawled out data from various German news sites using news-please library. After that, we manually cleaned the data and labeled it using Medienkompass. Then the dataset was preprocessed using HuggingFace NLP library.
Due to copyright issues, we can not publish the data, but we provided the list of URLs you can use to build this dataset on your own. To download all the data run:
NewsPlease.from_file('urls/urls.txt')
Then run the preprocessing script:
python preprocess.py -data_folder='path/to/your/downloaded/data'
We evaluated several classification models on the dataset, using Bag-of-Words, TF-IDF, and BERT features. For reproduction the former two, run BOW_baseline.ipynb and TFIDF_baseline.ipynb notebooks. To train BERT-based models you need to fine-tune HuggingFace implementation of German BERT.
python train.py -data_folder="data" model_folder="models/BERT" -batch_size=8 -num_epochs=2
After that run BERT_baseline.ipynb notebook.
Using our two based models for TF-IDF and BERT features, we implemented the demo system that can predict the political bias of a single arbitary text and generate the list of the words that pushes the system to make the decision. The models can be download from here. To use the system run:
python predict.py -file_path="text_sample.txt" -method="tfidf" -explain=False
or call in python:
from BiasPredictor import biasPredictor
predictor = biasPredictor("bert")
prediction = predictor.predict(text = "Ein politischer Text", explain=True)