5.0 KiB
Fake News Detection on FakeNewsCorpus and LIAR
This repository contains the code for the project on fake news detection using the FakeNewsCorpus and LIAR datasets. Logistic Regression, Näive Bayes, and DistilBERT models are used for classification. The project is implemented in Python and uses libraries such as scikit-learn, pandas, and transformers.
Sample logs from our final run of the scripts are stored in the archives directory.
Requirements
Python 3.12 was used for this project. There is no guarantee that it will work with other versions. The code has not been tested with any other version of Python.
All dependencies are listed in the requirements.txt file. You can install them using pip:
pip install -r requirements.txt
NOTE 1: If you are using a CUDA-capable GPU, make sure to install the appropriate version of PyTorch with CUDA support. The requirements.txt file includes the CPU version of PyTorch by default. If you want to use the GPU version, you can modify the requirements.txt file accordingly. You can find the installation instructions here.
NOTE 2: You should also verify FP16, BF16, and TF32 support for your GPU. The cuda_checker.py is not intended to provide an absolute guarantee of compatibility. You can find more information about CUDA compatibility here.
How to run and get similar results as our sample logs
- Download and extract the datasets into the
datadirectory. The directory structure should look like this:
data/
├── news_cleaned_2018_02_13.csv # FakeNewsCorpus dataset
└── liar_test.tsv # LIAR test dataset
- Clone the repository and navigate to the project directory.
- Install the required dependencies using the command mentioned above.
- Run
fnc1a.pyto process the FakeNewsCorpus dataset and create theprocessed_fakenews.parquetfile. This file will be used for training and evaluating our models.- The script will also create a
processed_fakenews.csvfile, which is the same as its parquet version but in CSV format. Just for reference. - The script will also convert the original CSV file to parquet format and save it as
news_cleaned_2018_02_13.parquet. This is done to avoid loading the original CSV file every time we need to process the data. The parquet file is much smaller in size and faster to read. - WARNING: This script will take a long time to run and use a lot of memory. It is recommended to run it in the background or on a separate machine.
- The final run took approximately 5 hours on a machine with 32GB of RAM + 16GB of swap and 8 CPU cores (
pandarallelonly used 6 to prevent out-of-memory error).
- The final run took approximately 5 hours on a machine with 32GB of RAM + 16GB of swap and 8 CPU cores (
- The script will also create a
- Run
fnc1b.pyto sample 10% of the data fromprocessed_fakenews.parquetand split it into training, validation, and test datasets. They will be saved assampled_fakenews_<split>.parquetfiles. The splits are:train: 80% of the sampled datavalid: 10% of the sampled datatest: 10% of the sampled data- The script will also create a
sampled_fakenews_<split>.csvfile, which is the same as its parquet version but in CSV format. Just for reference. - WARNING: This script will also take quite long to run and use a lot of VRAM. It is recommended to run it in the background or on a separate machine.
- The final run took approximately 4 hours on a machine with 8GB of VRAM.
- Run
fnc4a.pyto process the LIAR test dataset and create theliar_processed.csvfile. This file will be used for evaluating our models. - Run
fnc2.pyto train and evaluate the Logistic Regression and Näive Bayes models on the sampled FakeNewsCorpus dataset and the processed LIAR test dataset. The results are displayed in the terminal only. If you want to save the results to a file, run it as follows:
python -u src/fnc2.py > <output_file_name>.log 2>&1 &
- Run
fnc3.pyto train and evaluate the DistilBERT model on the sampled FakeNewsCorpus dataset. The steps are saved tosrc/results. In case the training is interrupted, you can resume it by running the script again withtrainer.train(resume_from_checkpoint=True). The trained model is saved tofake_news_bertdirectory. Again, the results are displayed in the terminal only. - Run
fnc4b.pyto evaluate the DistilBERT model on the processed LIAR test dataset. Again, the results are displayed in the terminal only.
Extra scripts
cuda_checker.py: A script to check if your GPU supports FP16, BF16, and TF32. It will also check if your GPU is compatible with the CUDA version you have installed. Do not completely rely on this script to check for compatibility. It is just a helper script to give you an idea of your GPU's capabilities.parquet_validator.py: A script to validate the parquet files created by thefnc1a.pyandfnc1b.py. It will check if the files are valid and if they can be read correctly. It will also check if the data types of the columns are correct. You can trust this one.