Incremental Word2Vec¶

Word2Vec¶

Word2Vec is a popular and widely-used word embedding model that is trained on large amounts of text data to produce dense and continuous vector representations for words. These word vectors capture semantic and syntactic relationships between words and can be used for various natural language processing tasks, such as document classification, sentiment analysis, and machine translation.

Word2Vec operates by defining a neural network architecture that takes as input a target word and the context words that appear near it. The model trains on large amounts of text data to learn the relationships between words and their context, and uses this knowledge to generate continuous vector representations for each word in the vocabulary. These vectors are typically lower dimensional than the high-dimensional one-hot encoding used to represent words in traditional NLP models, making them more efficient to work with and easier to interpret.

There are two popular variants of Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram. The choice of which model to use depends on the specific task at hand and the data being used.

SkipGram¶

SkipGram is a popular word embedding model introduced by Mikolov et al. in 2013. The model predicts the context words given a target word in a text corpus. SkipGram aims to learn the representation of words that can capture the semantic and syntactic relationships between words.

The model defines a context window around the target word and then uses this context information to predict the target word. The SkipGram model trains a separate neural network for each target word that uses the surrounding context words as inputs and the target word as the output. During training, the model is optimized to maximize the probability of the correct target word given the surrounding context words.

The learned word representations are used as the input to other NLP tasks, such as document classification, sentiment analysis, and machine translation. SkipGram has become a popular method for learning word representations because of its simplicity and ability to capture the semantic and syntactic relationships between words.

CBOW¶

The Continuous Bag of Words (CBOW) is a popular word embedding model in natural language processing. It is used to predict the target word based on its surrounding context words in a given sentence.

The model is trained on a large corpus of text data and uses a sliding window approach to construct the context around a target word. The context words are represented as one-hot vectors and are used as input to a neural network. The objective of the model is to predict the target word based on the context words, which are then used to compute the word embeddings.

CBOW is computationally efficient compared to other models such as Skip-Gram, as it predicts the target word from its context, rather than predicting the context from the target word. This makes it faster to train, especially when working with large corpora. The quality of the embeddings produced by CBOW can be further improved by considering more context words and adjusting the model hyperparameters.

Negative Sampling¶

The Negative Sampling technique is a method for training word embedding models proposed by Mikolov et al. in their seminal work on word2vec. Negative sampling aims to make the model's training process more efficient by only updating the parameters for a small set of negative samples instead of computing the full softmax over all words in the vocabulary.

Each training instance comprises a target word and a set of context words in the negative sampling technique. The goal is to predict the probability of the target word given the context words. Rather than computing the softmax over the entire vocabulary, negative sampling randomly selects a set of negative samples or words that are not the target word. The model is then trained to maximize the probability of the target word given the context words and minimize the probability of the negative samples.

By focusing on a smaller set of negative samples, the negative sampling technique can significantly reduce the computational cost of training the model. This, in turn, allows for larger training datasets and more complex models, leading to more accurate word embeddings. However, despite its efficiency, the negative sampling technique does have some limitations, such as a potential loss of information about words that are not selected as negative samples.

Incremental Word2Vec¶

Our implementation is based on the Skip Gram model with Negative Sampling, as proposed by Kaji and Kobayashi. This implementation extends the traditional unigram table, typically created as a static word array, to an incremental approach. Instead of performing multiple passes over the entire dataset to complete the unigram table, the model updates the table incrementally, making the process more efficient and scalable.

In Algorithm 3, we present the adaptive unigram table proposed by Kaji and Kobayashi. Given a fixed-size unigram table T with a capacity of 𝑠𝑖𝑧𝑒_𝑇 , an array 𝐹𝑟𝑒𝑞𝑠 representing the frequencies of the words in the vocabulary, a tuple of word indexes representing a tweet, and a smoother parameter 𝛼. The algorithm proceeds as follows:

If the number of elements in $𝑇$ , $|𝑇|$, is less than $𝑠𝑖𝑧𝑒\_𝑇$ , $𝐹$ copies of the word index are added to $T$, where the word index corresponds to the indexes mapped to the words that compose the vocabulary.
Otherwise, the number of copies of the word index added to $𝑇$ is calculated as $\frac{𝑠𝑖𝑧𝑒\_𝑇 \cdot 𝐹}{𝑧}$ , and the new additions to $𝑇$ may overwrite current values with a probability proportional to $\frac{𝐹}{𝑧}$ .

Import Libraries¶

In [1]:

            
                Copied!
                
from rivertext.utils import TweetStream
from torch.utils.data import DataLoader

from rivertext.models import IWord2Vec
from rivertext.utils import TweetStream
from torch.utils.data import DataLoader

from rivertext.models import IWord2Vec

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-b3323397d686> in <module>
----> 1 from rivertext.utils import TweetStream
      2 from torch.utils.data import DataLoader
      3 
      4 from rivertext.models import IWord2Vec

ModuleNotFoundError: No module named 'rivertext'

Incremental SkipGram¶

Load the Stream¶

In [2]:

            
                Copied!
                
ts1 = TweetStream("tweets.txt")
dataloader1 = DataLoader(ts1, batch_size=32)
ts1 = TweetStream("tweets.txt")
dataloader1 = DataLoader(ts1, batch_size=32)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-5522f4f7b8c8> in <module>
----> 1 ts1 = TweetStream("tweets.txt")
      2 dataloader1 = DataLoader(ts1, batch_size=32)

NameError: name 'TweetStream' is not defined

Define the model¶

In [3]:

            
                Copied!
                
                    
                    
                
                

        
isg = IWord2Vec(
    vocab_size=1_000_000, 
    unigram_table_size=100_000_000,
    window_size=3,
    sg=1,
    neg_samples_sum=3,
    emb_size=100, 
    device="cuda:0"
)
isg = IWord2Vec(
    vocab_size=1_000_000, 
    unigram_table_size=100_000_000,
    window_size=3,
    sg=1,
    neg_samples_sum=3,
    emb_size=100, 
    device="cuda:0"
)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-e44826ea00f0> in <module>
----> 1 isg = IWord2Vec(
      2     vocab_size=1_000_000,
      3     unigram_table_size=100_000_000,
      4     window_size=3,
      5     sg=1,

NameError: name 'IWord2Vec' is not defined

Training loop¶

In [4]:

            
                Copied!
                
for tweets in dataloader1:
    isg.learn_many(tweets)
for tweets in dataloader1:
    isg.learn_many(tweets)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-4106ab5ed10d> in <module>
----> 1 for tweets in dataloader1:
      2     isg.learn_many(tweets)

NameError: name 'dataloader1' is not defined

Getting the embeddings¶

In [5]:

            
                Copied!
                
embs1 = isg.vocab2dict()
embs1 = isg.vocab2dict()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-10914195be52> in <module>
----> 1 embs1 = isg.vocab2dict()

NameError: name 'isg' is not defined

In [6]:

            
                Copied!
                
embs1['is']
embs1['is']

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-fcd849e256b3> in <module>
----> 1 embs1['is']

NameError: name 'embs1' is not defined

Incremental CBOW¶

Load the Stream¶

In [7]:

            
                Copied!
                
ts2 = TweetStream("tweets.txt")
dataloader2 = DataLoader(ts2, batch_size=32)
ts2 = TweetStream("tweets.txt")
dataloader2 = DataLoader(ts2, batch_size=32)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-bc9d12e3b6d0> in <module>
----> 1 ts2 = TweetStream("tweets.txt")
      2 dataloader2 = DataLoader(ts2, batch_size=32)

NameError: name 'TweetStream' is not defined

Define the model¶

In [8]:

            
                Copied!
                
                    
                    
                
                

        
icbow = IWord2Vec(
    vocab_size=1_000_000, 
    unigram_table_size=100_000_000,
    window_size=3,
    neg_samples_sum=3,
    emb_size=100, 
    device="cuda:1"
)
icbow = IWord2Vec(
    vocab_size=1_000_000, 
    unigram_table_size=100_000_000,
    window_size=3,
    neg_samples_sum=3,
    emb_size=100, 
    device="cuda:1"
)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-4e86c76936dc> in <module>
----> 1 icbow = IWord2Vec(
      2     vocab_size=1_000_000,
      3     unigram_table_size=100_000_000,
      4     window_size=3,
      5     neg_samples_sum=3,

NameError: name 'IWord2Vec' is not defined

Training loop¶

In [9]:

            
                Copied!
                
for tweets in dataloader2:
    icbow.learn_many(tweets)
for tweets in dataloader2:
    icbow.learn_many(tweets)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-f029b0e670bf> in <module>
----> 1 for tweets in dataloader2:
      2     icbow.learn_many(tweets)

NameError: name 'dataloader2' is not defined

Getting the embeddings¶

In [10]:

            
                Copied!
                
embs2 = icbow.vocab2dict()
embs2 = icbow.vocab2dict()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-9ceb86467a84> in <module>
----> 1 embs2 = icbow.vocab2dict()

NameError: name 'icbow' is not defined

In [11]:

            
                Copied!
                
embs2['is']
embs2['is']

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-74be4a67813d> in <module>
----> 1 embs2['is']

NameError: name 'embs2' is not defined