Kaggle Multiclass Reddit Classification

One of the most well-known ML and NLP problems that often arise in the academical and industry paradigms is one of sentiment classification. In this case, we had the chance to explore a multi-class classification problem, in which given a dataset with Reddit comments and their respective subreddits (i.e. class), we had to train the best possible classifier, for 20 different classes. One of the biggest challenges of most ML problems is data cleaning, and the nature and big variance of natural language as it appears in very informal settings such as this one, was one of the reasons that make such classification problems so hard. The following is our attempt at this problem:

Our paper: https://drive.google.com/file/d/1PQ20eYt0ywUCgG7qd_r5dn_DSV9kVfRk/view?usp=sharing

Our code: https://github.com/JairParra/Kaggle_Reddit_Multiclass_Classification

Confusion matrix for the Linear Support Vector machines hyper tuned algorithm:

Dataset class distribution:


Popular Posts