Basics : Natural Language Processing with NLTK

Basics : Natural Language Processing with NLTK

Codes used here and more examples at GitHub

Each language has its different alphabet and set of rules, the NPL has to understand the different language structures, to be able to represent it in a accurate way by analyzing the data, with artificial intelligence. This allow us to create a wide quantity of important and essential applications, for example services dedicated to, make appointments, advertisements, chatbots, sentimental analysis, translating, checking grammar and analyzing social media content.

re module

re.split, re.findall, re.search, re.match

Search = Will look the whole string to look a matching string. Match = Requires that the string matches, or at least starts with the pattern. Some useful things you should know before to start learning NLTK library are regular expressions.

  • \w+ word "word"
  • \d+ digit 9
  • \s+ space ' '
  • .* letter/simbol [.()]
  • Capital letter any command will negate it, for example.
  • \S not space (a-z) = a, -, and z [A-Za-z]+ = Uppercase and lowecase [0-9] = Numbers 0 to 9 [A-Za-z-.]+ = Uppercase, lowercase, "-" and "." (\s+|,) = Spaces or comma
    pattern = r "[.!?]"
    pattern_2 = r"[a-z]\w+"
    #// anything_in_square brackets
    pattern_3 = r"\[.*\]" 
    pattern_4 = r"[A-Z]+:"
    re.command(pattern,string)
    

Natural Language Understanding (NLU)

Mapping the input, and analyze the different aspects of the language, to create an useful representation. It means that the system is capable to understand the meaning and intention of a phrase, to be able to give a response.

Tip to wrap text : pip install textwrap3 ; from textwrap3 import wrap ; x = wrap(text, 30)

Tokenization

It means to split a large text, into smaller lines seperating each word. (word_tokenize="strings", sent_tokenize"documents", regexp_tokenize"based on an expression pattern",TweetTokenizer"hashtagsmentions")

STEPS Convert strings into tokens:

  1. Break a sentence into words.
  2. Understand the importance of each word.
  3. Produce a structured description.
import nltk
def tokenization(): 
    text = "This is a sentence to try tokenization, with the library nltk. Separete points not commas"
    tokenizaton = nltk.sent_tokenize(text)
    print(tokenizaton)
def wordTokenization():
    text = "word tokenization, each word will get separated"
    tokens = nltk.word_tokenize(text)
    print(tokens)
  • Triagrams : Tokens that contains three consecutive words.
  • Bigrams : Tokens of two consecutive words.
  • Ngrams : Tokens of any quantity of consecutive words.
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import bigrams, trigrams, ngrams
tokens = nltk.word_tokenize(texto)
bigram = list(nltk.bigrams(tokens))

Stop Words

Stop words are those words which are not giving a big relevant meaning to a sentence, for example the articles, in english (the), or in german (die , der, das). A way to take off this words is the next one.

nltk.download('stopwords')

import nltk
def stopWords():  
    stops = set(stopwords.words('english'))
    words = ["There","was","a","day","where","i","used","to","study","in","the","park"]
    for word in words:
        if word not in stops:
            print(word)

List of words

from nltk.corpus import brown
print(brown.words())

Use a text for example.

nltk.corpus.gutenberg.fileids()
nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')

Tokenize a string.

variable = word_tokenize(text)
from nltk.probability import FreqDist
fdist = FreqDist()
for word in variable:
    fdist[word.lower()]+=1
print(fdist.items())

See the frequency of each token.

from nltk.tokenize import blankline_tokenize
variable = blankline_tokenize(text)

Stemming

Search the root of any word for example (loudness, louder -> loud)

from nltk.stem import PorterStemmer
pst = PorterStemmer()
words = ["give" "giving" "gave"]
for word in words:
    print(word + ":" + pst.stem(word) )
from nltk.stem import LancasterStemmer
pst = LancasterStemmer()
words = ["give" "giving" "gave"]
for word in words:
    print(word + ":" + pst.stem(word))
from nltk.stem import SnowballStemmer
pst = SnowballStemmer('english')
words = ["give" "giving" "gave"]
for word in words:
    print(word + ":" + pst.stem(word))

Lemmatization

Map several words to one (taken, took -> take)

from nltk.stem import wordnet, 
def lemmatization():
    word_lem = WordNetLemmatizer()
    word_lem.lemmatize("corpora")
for words in words_to_stem:
      print(words + ":" + word_lem.lemmatize(words))
from nltk.stem import WordNetLemmatizer
alpha_only = [t for t in lower_tokens if t.isalpha()]
no_stops = [t for t in alpha_only if t not in english_stops]
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
bow = Counter(lemmatized)
print(bow.most_common(10))

POS

Parts of Speech are the parts that make up any phrase

For example the sentence "the guy buys a car" , DET - NOUN - VERB - DET - NOUN

description = nltk.help.upenn_tagset("NN")

def pos():
    string = "The car is speeding fast"
    tokens = nltk.word_tokenize(string)

    #taggued_text = nltk.pos_tag(tokens)
    #print(tokens)
    ###########################
    for token in tokens:                 
        print(nltk.pos_tag([token]))