Fake News Detection
Fake News Detection in Python
In this project, we have used various natural language processing techniques and machine learning algorithms to classify fake news articles using sci-kit libraries from python.
Dataset used
The data source used for this project is LIAR dataset which contains 3 files with .tsv format for test, train and validation. Below is some description about the data files used for this project.
LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION
the original dataset contained 13 variables/columns for train, test and validation sets as follows:
- Column 1: the ID of the statement ([ID].json).
- Column 2: the label. (Label class contains: True, Mostly-true, Half-true, Barely-true, FALSE, Pants-fire)
- Column 3: the statement.
- Column 4: the subject(s).
- Column 5: the speaker.
- Column 6: the speaker’s job title.
- Column 7: the state info.
- Column 8: the party affiliation.
- Column 9-13: the total credit history count, including the current statement.
- 9: barely true counts.
- 10: false counts.
- 11: half true counts.
- 12: mostly true counts.
- 13: pants on fire counts.
- Column 14: the context (venue / location of the speech or statement).
To make things simple we have chosen only 2 variables from this original dataset for this classification. The other variables can be added later to add some more complexity and enhance the features.
Below are the columns used to create 3 datasets that have been in used in this project
- Column 1: Statement (News headline or text).
- Column 2: Label (Label class contains: True, False)
You will see that newly created dataset has only 2 classes as compared to 6 from original classes. Below is method used for reducing the number of classes.
- Original – New
- True – True
- Mostly-true – True
- Half-true – True
- Barely-true – False
- False – False
- Pants-fire – False
The dataset used for this project were in csv format named train.csv, test.csv and valid.csv and can be found in repo. The original datasets are in “liar” folder in tsv format.