NLP Preprocessing Techniques

Introduction

Natural Language Processing (NLP) has emerged as a fascinating field at the intersection of computer science, linguistics, and artificial intelligence. NLP techniques enable machines to understand, analyze, and generate human language, offering a myriad of applications, from sentiment analysis and chatbots to machine translation and speech recognition. In this blog we will learn how to preprocess text and what are different technique to clean text data.

What is Natural Language Processing (NLP)?

Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language. Its primary goal is to enable machines to understand and process human language in a way that is meaningful and useful. NLP techniques bridge the gap between human communication and computational understanding.

The Preprocessing Steps:

Preprocessing involves several crucial steps to clean, structure, and prepare the data for further analysis. Some of the most important preprocessing techniques are as follows:

  • Tokenization: Tokenization is the process of breaking a text into smaller units, or tokens, typically words or sub words. Tokenization allows the NLP model to understand the structure of the text and analyze it at a granular level. Tools like NLTK and spaCy provide tokenization capabilities.
from nltk import word_tokenize,sent_tokenize
text = "Hello and welcome to pytechie. Here we will learn nlp preprocessing techniques!"
print(word_tokenize(text))
print(sent_tokenize(text))
  • Lowercasing: Convert all text data to lowercase. This step reduces the dimensionality of the data and simplifies further processing.
text.lower()
  • Stopword Removal: Stopwords are common words that add little meaning to the text. Removing them reduces noise in the data and can improve the efficiency of NLP algorithms.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
[word for word in word_tokenize(text) if word not in stop_words]
  • Stemming and Lemmatization: Stemming and lemmatization are techniques used to reduce words to their root or base form. Stemming is a more aggressive approach, while lemmatization provides a more accurate representation of words. For example, “running” could be stemmed to “run” and lemmatized to “run.”
from nltk.stem import PorterStemmer, WordNetLemmatizer
porter = PorterStemmer()
print([porter.stem(word) for word in word_tokenize(text) ])

lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word,'v) for word in word_tokenize(text)])
  • Encoding: NLP models typically require input data in a numerical format. Encoding techniques like one-hot encoding or word embeddings are used to convert text into a numerical representation that the model can work with.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sent_tokenize(text))
print(vectorizer.vocabulary_)
vectorizer.transform(sent_tokenize(text)).toarray()
  • Special character removal: Text data often contains special characters, symbols, and numbers that might not be relevant for analysis. Removing or appropriately handling these elements is vital to improve the quality of the data.
import re, string
re.sub('[%s]'%re.escape(string.punctuation),'',text)
  • POS (Part-Of-Speech) Tagging: Part-of-speech tagging (POS tagging) is a fundamental natural language processing (NLP) technique that helps us dissect the structure of language by assigning grammatical categories or parts of speech to individual words within a sentence or text.
from nltk import pos_tag
tags = pos_tag(word_tokenize(text))
tags
  • Correct Misspelling: Correct sentences or words which having spelling mistake.
from textblob import TextBlob 
tb = TextBlob(text)  
correct_spell = tb.correct() 
print(correct_spell)
  • Chunking (Named Entity Recognition): Named Entity Recognition (NER) is a sub-task of information extraction in Natural Language Processing (NLP) that classifies named entities into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, and more.
from nltk.chunk import 
ne_chunk(pos_tag(word_tokenize(text))).draw()

Resources

Leave a Reply