Introduction
Text classification is important task in natural language processing (NLP) that involves categorizing text into predefined classes or categories. This powerful technique finds applications in spam filtering, sentiment analysis, topic categorization and more. In this blog we will explore common approaches and provide practical insights into implementing text classification in Python.
Understanding Text Classification
Text classification involves training machine learning model to recognize patterns in text data and assign predefined labels. This process typically consists of following key steps:
- Data collection and Preparation: Collect the labeled dataset where each document is associated with its corresponding category. Preprocess the text by cleaning, tokenizing etc.
- Split the data: Split the data into training and test set.
- Feature Extraction: Convert text data into a format suitable for ML algorithms. Common methods include the Bag of words model, TF-IDF & word embedding.
- Model Training: Select a suitable machine learning algorithm and train it on the labeled datasets.
- Evaluation: Evaluate the model performance using metrics like accuracy, precision, recall, f1 score. Fine tune the model parameters if necessary.
Approaches to Text Classification
- Rule Based Classification: Rule based systems use predefined rules and patterns to classify text. They may lack the ability to capture complex relationships in language.
- Traditional Machine Learning Approaches: SVM, Decision Tree, Random Forests etc.
- Deep Learning Approaches: CNN, RNN, Transformers etc.
Python Implementations
- ML Approach: Here is an example of text classification using ML based approach.
import re import string import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score, precision_recall_fscore_support train_data = pd.read_csv("train.csv") train_data.head(2) alphanumeric = lambda x: re.sub(r"""\w*\d\w*""",'',x) punc_lower = lambda x:re.sub('[%s]'%re.escape(string.punctuation),'',x.lower()) train_data['text'] = train_data['text'].map(alphanumeric).map(punc_lower) train_sent, val_sent, train_label, val_label = train_test_split(train_data['text'], train_data['target'],test_size=0.1,random_state=42) model = Pipeline([ ("tfidf",TfidfVectorizer(stop_words="english")),("clf",MultinomialNB()) ]) model.fit(train_sent,train_label) prediction = model.predict(val_sent) accuracy_score(val_label, prediction) precision,recall,f1_score,_ = precision_recall_fscore_support(val_label, prediction,average="weighted")
More
scikit-learn: machine learning in Python — scikit-learn 1.4.2 documentation