Text Classification in NLP

Introduction

Text classification is important task in natural language processing (NLP) that involves categorizing text into predefined classes or categories. This powerful technique finds applications in spam filtering, sentiment analysis, topic categorization and more. In this blog we will explore common approaches and provide practical insights into implementing text classification in Python.

Understanding Text Classification

Text classification involves training machine learning model to recognize patterns in text data and assign predefined labels. This process typically consists of following key steps:

  1. Data collection and Preparation: Collect the labeled dataset where each document is associated with its corresponding category. Preprocess the text by cleaning, tokenizing etc.
  2. Split the data: Split the data into training and test set.
  3. Feature Extraction: Convert text data into a format suitable for ML algorithms. Common methods include the Bag of words model, TF-IDF & word embedding.
  4. Model Training: Select a suitable machine learning algorithm and train it on the labeled datasets.
  5. Evaluation: Evaluate the model performance using metrics like accuracy, precision, recall, f1 score. Fine tune the model parameters if necessary.

Approaches to Text Classification

  1. Rule Based Classification: Rule based systems use predefined rules and patterns to classify text. They may lack the ability to capture complex relationships in language.
  2. Traditional Machine Learning Approaches: SVM, Decision Tree, Random Forests etc.
  3. Deep Learning Approaches: CNN, RNN, Transformers etc.

Python Implementations

  1. ML Approach: Here is an example of text classification using ML based approach.
import re
import string
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

train_data = pd.read_csv("train.csv")
train_data.head(2)

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""",'',x)
punc_lower = lambda x:re.sub('[%s]'%re.escape(string.punctuation),'',x.lower())
train_data['text'] = train_data['text'].map(alphanumeric).map(punc_lower)

train_sent, val_sent, train_label, val_label = train_test_split(train_data['text'], train_data['target'],test_size=0.1,random_state=42)

model = Pipeline([
    ("tfidf",TfidfVectorizer(stop_words="english")),("clf",MultinomialNB())
])
model.fit(train_sent,train_label)

prediction = model.predict(val_sent)

accuracy_score(val_label, prediction)

precision,recall,f1_score,_ = precision_recall_fscore_support(val_label, prediction,average="weighted")

More

NLP – pytechie.com

scikit-learn: machine learning in Python — scikit-learn 1.4.2 documentation

Leave a Reply