= ['I like deeplearning.', 'I like NLP.', 'NLP is awesome.'] sent_list
Advanced Feature Extraction from Text
In the previous article, I discussed basic feature extraction methods like BOW, TFIDF but, these are very sparse in nature. In this tutorial, we will try to explore word vectors this gives a dense vector for each word. There are many ways to get the dense vector representation for the words. below are some of them
Co-occurrence Matrix and SVD
We can create a co-occurrence matrix of text and then get a low rank approximation of matrix to get the dense feature representation.
To create a co-occurrence matrix, you go through text setting a window size around each word. You then keep track of which words appear in that window.
lets create co-occurrence matrix with below sentences.
with window size of 1. the co-occurrence matrix is
like
word came in context of i
2 times in window size one. in similar way, I updated above co-occurrence matrix with all counts.
Code
I have written a brute force version of code below.
import tensorflow as tf
import numpy as np
def cooccurrence_matrix(distance,sentances):
'''
Returns co-occurrence matrix of words with in a distance of occurrrence
input:
distance: distance between words(Window Size)
sentances: documets to check ( a list )
output:
co-occurance matrix in te order of list_words order
words list
'''
= tf.keras.preprocessing.text.Tokenizer()
tokenizer
tokenizer.fit_on_texts(sentances)= list(tokenizer.word_index.keys())
list_words #print(list_words)
#length of matrix needed
= len(list_words)
l #creating a zero matrix
= np.zeros((l,l))
com #creating word and index dict
= {v:i for i,v in enumerate(list_words)}
dict_idx for sentence in sentances:
= tokenizer.texts_to_sequences([sentence])[0]
sentence = [tokenizer.index_word[i] for i in sentence]
tokens #tokens= sentence.split()
for pos,token in enumerate(tokens):
#if eord is in required words
if token in list_words:
#start index to check any other word occure or not
=max(0,pos-distance)
start#end index
=min(len(tokens),pos+distance+1)
endfor pos2 in range(start,end):
#if same position
if pos2==pos:
continue
# if same word
if token == tokens[pos2]:
continue
#if word found is in required words
if tokens[pos2] in list_words:
#index of word parent
= dict_idx[token]
row #index of occurance word
= dict_idx[tokens[pos2]]
col #adding value to that index
= com[row,col] + 1
com[row,col] return com, list_words
= cooccurrence_matrix(1, sent_list)
coo print(coo[1])
print(coo[0])
['i', 'like', 'nlp', 'deeplearning', 'is', 'awesome']
[[0. 2. 0. 0. 0. 0.]
[2. 0. 1. 1. 0. 0.]
[0. 1. 0. 0. 1. 0.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 1.]
[0. 0. 0. 0. 1. 0.]]
Now we can use SVD to get low rank approximation matrix(This will give dense matrix)
from sklearn.decomposition import TruncatedSVD
= TruncatedSVD(n_components=3, n_iter=10, random_state=32 )
tsvd = tsvd.fit_transform(coo[0], )
dense_vector dense_vector
array([[ 1.94649798e+00, 2.73880515e-15, -2.49727487e-01],
[-2.40633313e-15, 2.43040910e+00, -2.56144970e-01],
[ 1.20300191e+00, 1.58665133e-15, 4.04067562e-01],
[ 9.73248989e-01, 1.21830556e-15, -1.24863743e-01],
[ 7.73781546e-16, 5.73741760e-01, 1.08504750e+00],
[ 2.29752921e-01, 3.09635046e-16, 5.28931305e-01]])
print("Vector of ", "'" , coo[1][1], "'", "is ", dense_vector[1])
Vector of ' like ' is [-2.40633313e-15 2.43040910e+00 -2.56144970e-01]
Word2Vec
I think, there are many articles and videos regarding the Mathematics and Theory of Word2Vec. So, I am giving some links to explore and I will try to explain code to train the custom Word2Vec. Please check the resources below.
youtube: https://www.youtube.com/watch?list=PLUOY9Q6mTP21Al_odE-v_lmHDjVMSO9BX&v=SSpSk1Io52w&feature=emb_title
You can read a good blog here
Please watch the above videos or read the above blog before going into the coding part.
Word2Vec using Gensim
We can train word2vec
using gensim
module with CBOW
or Skip-Gram
( Hierarchical Softmax/Negative Sampling). It is one of the efficient ways to train word vectors. I am training word vectors using gensim, using IMDB reviews as a data corpus to train. In this, I am not training the best word vectors, only training for 10 iterations.
To train gensim word2vec
module, we can give a list of sentences or a file a corpus file in LineSentence
format. Here I am creating a list of sentences from my corpus. If you have huge data, please try to use LineSentence
format to efficiently train your word vectors.
##getting sentence wise data
= [nltk.word_tokenize(sent) for sent_tok in data_imdb.review for sent in nltk.sent_tokenize(sent_tok)] list_sents
Training gensim word2vec
as below
##import gensim
from gensim.models import Word2Vec
##word2vec model ##this may take some time to execute.
= Word2Vec(list_sents,##list of sentences, if you don;t have all the data in RAM, you can give file name to corpus_file
word2vec_model =50, ##output size of word emebedding
size=4, ##window size
window=1, ## ignors all the words with total frquency lower than this
min_count=5, ##number of workers to use
workers=1, ## skip gram
sg=0, ## 1 --> hierarchical, 0 --> Negative sampling
hs=5, ##How many negative samples
negative=0.03, ##The initial learning rate
alpha=0.0001, ##Learning rate will linearly drop to min_alpha as training progresses.
min_alpha= 54, ##random seed
seed iter=10,
=True)##number of iterations compute_loss
You can get word vectors as below
##getting a word vector
'movie'] word2vec_model.wv[
You can get most similar positive words for any given word as below
##getting most similar positive words
='movie') word2vec_model.wv.most_similar(positive
You can save your model as below
##saving the model
'w2vmodel/w2vmodel') word2vec_model.save(
You can get the total notebook in the below GitHub link
github: https://github.com/UdiBhaskar/Natural-Language-Processing/blob/master/Feature%20Extraction%20Methods/Advanced%20feature%20extraction%20-%20W2V/W2V_using_Gensim.ipynb
Word2Vec using Tensorflow ( Skip-Gram, Negative Sampling)
In the negative sampling, we will get a positive pair of skip-grams
and for every positive pair, we will generate n number of negative pairs. I used only 10 negative pairs. In the paper, they suggesting around 25. Now we will use these positive and negative pairs and try to create a classifier that differentiates both positive and negative samples. While doing this, we will learn the word vectors. We have to train a classifier that differentiates positive sample and negative samples, while doing this we will learn the word embedding. Classifier looks like below image
The above model takes two inputs center word, context word and, model output is one if those two words occur within a window size else zero.
Preparing the data
We have to generate the skip-gram
pairs and negative samples. We can do that easily using tf.keras.preprocessing.sequence.skipgrams
. This also takes a probability table(sampling table), in which we can give the probability of that word to utilize in the negative samples i.e. we can make probability low for the most frequent words and high probability for the least frequent words while generating negative samples.
Converted total words into the number sequence. Numbers are given in descending order of frequency.
##to use tf.keras.preprocessing.sequence.skipgrams, we have to encode our sentence to numbers. so used Tokenizer class
= tf.keras.preprocessing.text.Tokenizer()
tokenizer
tokenizer.fit_on_texts(list_sents)= tokenizer.texts_to_sequences(list_sents) ##list of list+ seq_texts
If we create total samples at once, it may take so much RAM
and that gives the resource exhaust error. so created a generator function which generates the values batchwise.
##Skipgram with Negativive sampling generator
##for generating the skip gram negative samples we can use tf.keras.preprocessing.sequence.skipgrams and
#internally uses sampling table so we need to generate sampling table with tf.keras.preprocessing.sequence.make_sampling_table
= tf.keras.preprocessing.sequence.make_sampling_table(size=len(tokenizer.word_index)+1,
sampling_table_ns =1e-05)
sampling_factordef generate_sgns():
##loop through all the sequences
for seq in seq_texts:
= tf.keras.preprocessing.sequence.skipgrams(sequence=seq,
generated_samples, labels =len(tokenizer.word_index)+1,
vocabulary_size=3, negative_samples=10,
window_size=sampling_table_ns)
sampling_table= len(generated_samples)
length_samples for i in range(length_samples):
##centerword, context word, label
yield [generated_samples[i][0]], [generated_samples[i][1]], [labels[i]]
##creating the tf dataset
= tf.data.Dataset.from_generator(generate_sgns, output_types=(tf.int64, tf.int64, tf.int64))
tfdataset_gen = tfdataset_gen.repeat().batch(2048).prefetch(tf.data.experimental.AUTOTUNE) tfdataset_gen
Creating Model
##fixing numpy RS
42)
np.random.seed(
##fixing tensorflow RS
32)
tf.random.set_seed(
##python RS
12)
rn.seed(
tf.keras.backend.clear_session()
##model
def getSGNS():
= Input(shape=(1,), name="center_word_input")
center_word_input= Input(shape=(1,), name="context_word_input")
context_word_input
##i am initilizing randomly. But you can use predefined embeddings.
= Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=100,
embedd_layer =tf.keras.initializers.RandomUniform(seed=45),
embeddings_initializer="Embedding_layer")
name
#center word embedding
= embedd_layer(center_word_input)
center_wv
#context word embedding
= embedd_layer(context_word_input)
context_wv
#dot product
= Dot(axes=2, name="dot_between_center_context")([center_wv, context_wv])
dot_out
= Reshape((1,), name="reshaping")(dot_out)
dot_out
= Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.glorot_uniform(seed=54),
final_out ="output_layer")(dot_out)
name
= Model(inputs=[center_word_input, context_word_input], outputs=final_out, name="sgns_w2v")
basic_w2v
return basic_w2v
= getSGNS() sgns_w2v
Training
##training
##optimizer
= tf.keras.optimizers.Adam(learning_rate=0.005)
optimizer
##train step function to train
@tf.function
def train_step(input_center, input_context, output_vector, loss_fn):
with tf.GradientTape() as tape:
#forward propagation
= sgns_w2v(inputs=[input_center, input_context], training=True)
output_predicted #loss
= loss_fn(output_vector, output_predicted)
loss #getting gradients
= tape.gradient(loss, sgns_w2v.trainable_variables)
gradients #applying gradients
zip(gradients, sgns_w2v.trainable_variables))
optimizer.apply_gradients(return loss, gradients
##number of epochs
=100000
no_iterations
##metrics # Even if you use .fit method, it alsocalculates batchwise loss/metric and aggregates those.
= tf.keras.metrics.Mean(name='train_loss')
train_loss
#tensorboard file writers
= tf.summary.create_file_writer(logdir='/content/drive/My Drive/word2vec/logs/w2vns/train')
wtrain
##creating a loss object for this classification problem
= tf.keras.losses.BinaryCrossentropy(from_logits=False,
loss_function ='auto')
reduction
##check point to save
= "/content/drive/My Drive/word2vec/checkpoints/w2vNS/train"
checkpoint_path = tf.train.Checkpoint(optimizer=optimizer, model=sgns_w2v)
ckpt = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)
ckpt_manager
= 0
counter #training anf validating
for in_center, in_context, out_label in tfdataset_gen:
#train step
= train_step(in_center, in_context, out_label, loss_function)
loss_, gradients #adding loss to train loss
train_loss(loss_)
= counter + 1
counter
##tensorboard
with tf.name_scope('per_step_training'):
with wtrain.as_default():
"batch_loss", loss_, step=counter)
tf.summary.scalar(with tf.name_scope("per_batch_gradients"):
with wtrain.as_default():
for i in range(len(sgns_w2v.trainable_variables)):
= sgns_w2v.trainable_variables[i].name
name_temp =counter)
tf.summary.histogram(name_temp, gradients[i], step
if counter%100 == 0:
#printing
= '''Done {} iterations, Loss: {:0.6f}'''
template
print(template.format(counter, train_loss.result()))
if counter%200 == 0:
= ckpt_manager.save()
ckpt_save_path print ('Saving checkpoint for iteration {} at {}'.format(counter+1, ckpt_save_path))
train_loss.reset_states()if counter > no_iterations:
break
You can check total code and results in my GitHub link below.
github: https://github.com/UdiBhaskar/Natural-Language-Processing/blob/master/Feature%20Extraction%20Methods/Advanced%20feature%20extraction%20-%20W2V/W2V_Tensorflow_Negative_Sampling.ipynb
Saved the model into gensim Word2Vec format and loaded
=True, fname='w2vns.bin', total_vec=len(word_vectors_dict), vocab=model_gensim.vocab, vectors=model_gensim.vectors)
save_word2vec_format_dict(binary= gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('w2vns.bin', binary=True) model_gensim
Important:
Negative Sampling
is a simplified version ofNoise Contrastive Estimation
.NCE
guarantees approximation to softmax,Negative Sampling
doesn’t. You can read this in paper/blog.
Word2Vec using Tensorflow (Skip-Gram, NCE)
Let’s take a which gives the score to each pair of the skip-grams
, we will try to maximize the (score of positive pairs to the word - score of negative pairs)
to the word. We can do that directly by optimizing the tf.nn.nce_loss
. Please try to read the documentation. It takes a positive pair, weight vectors and then generates the negative pairs based on sampled_values
, and gives the loss.
Preparing the Data
We have to generate a positive pair of skip-grams, we can do it in a similar way as above. Created a pipeline to generate batchwise data as below.
##getting sentence wise data
= [nltk.word_tokenize(sent) for sent_tok in data_imdb.review for sent in nltk.sent_tokenize(sent_tok)]
list_sents ##to use tf.keras.preprocessing.sequence.skipgrams, we have to encode our sentence to numbers. so used Tokenizer class
= tf.keras.preprocessing.text.Tokenizer()
tokenizer
tokenizer.fit_on_texts(list_sents)= tokenizer.texts_to_sequences(list_sents) ##list of list
seq_texts
def generate_sgns():
for seq in seq_texts:
= tf.keras.preprocessing.sequence.skipgrams(sequence=seq,
generated_samples, labels =len(tokenizer.word_index)+1,
vocabulary_size=2, negative_samples=0)
window_size= len(generated_samples)
length_samples for i in range(length_samples):
yield [generated_samples[i][0]], [generated_samples[i][1]]
##creating the tf dataset
= tf.data.Dataset.from_generator(generate_sgns, output_types=(tf.int64, tf.int64))
tfdataset_gen = tfdataset_gen.repeat().batch(1024).prefetch(tf.data.experimental.AUTOTUNE) tfdataset_gen
Creating Model
I created a model word2vecNCS which takes a center word, context word and give NCE loss. You can check that below.
class word2vecNCS(Model):
def __init__(self, vocab_size, embed_size, num_sampled, **kwargs):
'''NCS Word2Vec
vocab_size: Size of vocabulary you have
embed_size: Embedding size needed
num_sampled: No of negative sampled to generate'''
super(word2vecNCS, self).__init__(**kwargs)
self.vocab_size = vocab_size
self.embed_size = embed_size
self.num_sampled = num_sampled
##embedding layer
self.embed_layer = Embedding(input_dim=vocab_size, output_dim=embed_size,embeddings_initializer=tf.keras.initializers.RandomUniform(seed=32))
##reshing layer
self.reshape_layer = Reshape((self.embed_size,))
def build(self, input_shape):
##weights needed for nce loss
self.nce_weight = self.add_weight(shape=(self.vocab_size, self.embed_size),
=tf.keras.initializers.TruncatedNormal(mean=0, stddev= (1/self.embed_size**0.5)),
initializer=True, name="nce_weight")
trainable#biases needed nce loss
self.nce_bias = self.add_weight(shape=(self.vocab_size), initializer="zeros", trainable=True, name="nce_bias")
def call(self, input_center_word, input_context_word):
'''
input_center_word: center word
input_context_word: context word'''
##giving center word and getting the embedding
= self.embed_layer(input_center_word)
embedd_out ##rehaping
= self.reshape_layer(embedd_out)
embedd_out ##calculating nce loss
= tf.reduce_sum(tf.nn.nce_loss(weights=self.nce_weight,
nce_loss =self.nce_bias,
biases=input_context_word,
labels=embedd_out,
inputs=self.num_sampled,
num_sampled=self.vocab_size))
num_classesreturn nce_loss
Training
##training
##optimizer
= tf.keras.optimizers.Adam(learning_rate=0.005)
optimizer
= word2vecNCS(len(tokenizer.word_index)+1, 100, 32, name="w2vNCE")
sgncs_w2v
##train step function to train
@tf.function
def train_step(input_center, input_context):
with tf.GradientTape() as tape:
#forward propagation
= sgncs_w2v(input_center, input_context)
loss
#getting gradients
= tape.gradient(loss, sgncs_w2v.trainable_variables)
gradients #applying gradients
zip(gradients, sgncs_w2v.trainable_variables))
optimizer.apply_gradients(
return loss, gradients
##number of epochs
=10000
no_iterations
##metrics # Even if you use .fit method, it alsocalculates batchwise loss/metric and aggregates those.
= tf.keras.metrics.Mean(name='train_loss')
train_loss
#tensorboard file writers
= tf.summary.create_file_writer(logdir='/content/drive/My Drive/word2vec/logs/w2vncs/train')
wtrain
##check point to save
= "/content/drive/My Drive/word2vec/checkpoints/w2vNCS/train"
checkpoint_path = tf.train.Checkpoint(optimizer=optimizer, model=sgncs_w2v)
ckpt = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)
ckpt_manager
= 0
counter #training anf validating
for in_center, in_context in tfdataset_gen:
#train step
= train_step(in_center, in_context)
loss_, gradients #adding loss to train loss
train_loss(loss_)
= counter + 1
counter
##tensorboard
with tf.name_scope('per_step_training'):
with wtrain.as_default():
"batch_loss", loss_, step=counter)
tf.summary.scalar(with tf.name_scope("per_batch_gradients"):
with wtrain.as_default():
for i in range(len(sgncs_w2v.trainable_variables)):
= sgncs_w2v.trainable_variables[i].name
name_temp =counter)
tf.summary.histogram(name_temp, gradients[i], step
if counter%100 == 0:
#printing
= '''Done {} iterations, Loss: {:0.6f}'''
template
print(template.format(counter, train_loss.result()))
if counter%200 == 0:
= ckpt_manager.save()
ckpt_save_path print ('Saving checkpoint for iteration {} at {}'.format(counter+1, ckpt_save_path))
train_loss.reset_states()if counter > no_iterations :
break
You can check total code and results in my GitHub link below.
github: https://github.com/UdiBhaskar/Natural-Language-Processing/blob/master/Feature%20Extraction%20Methods/Advanced%20feature%20extraction%20-%20W2V/W2V_Tensorflow_NCE.ipynb
Fast-text Embedding (Sub-Word Embedding)
Instead of feeding individual words into the Neural Network, FastText
breaks words into several n-grams (sub-words). For instance, tri-grams for the word where is <wh, whe, her, ere, re>
and the special sequence <where>
. Note that the sequence, corresponding to the word her is different from the tri-gram her from the word where. Because of these subwords, we can get embedding for any word we have even it is a misspelled word. Try to read this paper.
We can train these vectors using the gensim
or fastText
official implementation. Trained fastText
word embedding with gensim, you can check that below. It’s a single line of code similar to Word2vec.
##FastText module
from gensim.models import FastText
= FastText(sentences=list_sents,
gensim_fasttext =1, ##skipgram
sg=0, #negative sampling
hs=4, ##min count of any vocab
min_count=10, ##no of negative samples
negativeiter=15, ##no of iterations
=100, ##dimentions of word
size=3, ##window size to get the skipgrams
window=34) seed
You can get the total code in the below GitHub
github: https://github.com/UdiBhaskar/Natural-Language-Processing/blob/master/Feature%20Extraction%20Methods/Advanced%20feature%20extraction%20-%20W2V/fasttext_Training.ipynb
Pre-Trained Word Embedding
We can get pre-trained word embedding that was trained on huge data by Google, Stanford NLP, Facebook.
Google Word2Vec
You can download google’s pretrained wordvectors trained on Google news data from this link. You can load the vectors as gensim
model like below
= gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) googlew2v_model
GloVe Pretrained Embeddings
You can download the glove embedding from this link. There are some differences between Google Word2vec save format and GloVe save format. We can convert Glove format to google format and then load that using gensim
as below.
from gensim.scripts.glove2word2vec import glove2word2vec
="glove.42B.300d.txt", word2vec_output_file="w2vstyle_glove_vectors.txt")
glove2word2vec(glove_input_file
= gensim.models.KeyedVectors.load_word2vec_format("w2vstyle_glove_vectors.txt", binary=False) glove_model
FastText Pretrained Embeddings
You can get the fasttext
word embeedings from this link. You can use fasttext
python api or gensim
to load the model. I am using gensim.
from gensim.models import FastText
= FastText.load_fasttext_format("/content/cc.en.300.bin") fasttext_model
References:
- gensim documentation
- https://fasttext.cc/
- CS7015 - IIT Madras
- https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
- https://arxiv.org/abs/1410.8251
- https://ruder.io/word-embeddings-softmax/