Fake News Detection is one of the most challenging problem that is faced by various online social media sites. With numbers of online users increasing everyday, information on the internet is increasing in exponential rates and any user can share anything on social media sites. Often, there are some group of individuals who share manipulated information on social media sites and majority of people believe their false stories. So, nowadays it has become more important for the social media sites to detect the fake news content and delete that content.
To study this problem in detail, we implemented a machine learning project on fake news detection as a part of Machine Learning Course of Monsoon 2020 semester. This blog gives you detailed description of our project.
Dataset Description
We collected dataset from kaggle which is divided into two parts training and test dataset which contains features like id, title, author, text and label. We have used title and text features for our classification, training dataset contains around 20,000 samples and testing dataset around 5,000 samples.
Text Pre Processing
- Removed unnecessary columns like id and author from the dataset.
- Combining title and text column.
- Replacing English contractions
- Tokenization based on nltk library
- Removing English stop words from the dataset.
- Stemming and removal of single character words.
- Encoding the text to vectors.
- Padding the encoded text vectors to get uniform dimension for each data sample.
Model Details
Different models were trained using the pre processed dataset. The details of the models are given below:-
- Naive Bayes and Logistic regression: We used bag of words and tf-idf vectorization techniques to create vectors for our text data. After the vectorization, we trained naive bayes and logistic regression model using the transformed dataset.
- Deep Learning Techniques: We have used two variation of neural network architecture, one is simple LSTM(Long Short Term Memory) and another one is LSTM with Convolutional Neural Network.
Neural Network Models
Model 1
We used glove embedding for embedding layer. After that we have used simple LSTM layer along with one dense layer.
Model 2
In this neural network model we have used Word2Vec embedding as a embedding layer and them we have used 2 CNN layer along with Maxpooling. After that we have used one LSTM layer along with one dense layer.
Results and Conclusion
The results obtained for different models are tabulated below:-
By analyzing the accuracy results, we can clearly conclude that neural network models performed better than naive bayes and logistic regression model.
Blog Authors and Contributions
- Biman Giri: Literature Survey, Bag of words vectorization, Naive Bayes, Logistic Regression model implementation, Convolutional Neural Network implementation.
- Vishesh Goel: Dataset Preprocessing, TF-IDF vectorization, Naive Bayes, Logistic Regression model implementation, LSTM Neural Network implementation.
This project was done under the guidance of Dr. Tanmoy Chakraborty. IIIT Delhi and Teaching Assistant Pragya Srivastava, IIIT Delhi.
References:
- Mykhailo Granik,Volodymyr Mesyura,Fake News Detection Using Naive Bayes Classifier,2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON),978–1–5090–3006–4/17,IEEE.
- Dimitrios Katsaros,George Stavropoulos,Dimitrios Papakostas,Which machine learning paradigm for fake news detection?2019 Association for Computing Machinery. ACM ISBN 978–1–4503–6934- 3/19/10
- Smitha. N,Bharath .R,Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978–1- 7281–5374–2
- Rohit Kumar Kaliyar,2018 4th International Conference on Computing Communication and Automation (ICCCA),978–1–5386–6947- 1/18.
- Ayat Abedalla , Aisha Al-Sadi ,Malak Abdullah A Closer Look at Fake News Detection: A Deep Learning Perspective (ACM)ICAAI 2019, October 26–28, 2019, Istanbul, Turkey