word2vec

Implementation of the Incremental SkipGram and CBOW algorithms.

`IWord2Vec`

Bases: IWVBase

Word2Vec incremental architectures is an adaptation of the popular word2vec proposed by Mikolov et al. to the streaming scenario. To adapt these algorithms to a streaming setting, we rely on the Incremental SkipGram with Negative Sampling model proposed by Kaji et al. The main assumptions we consider are:

The models must deal with the fact that the vocabulary is dynamic and unknown, so the structures are updated as it is trained.
The unigram table is created incrementally using the algorithm proposed by Kaji et al.
The internal structure of the architecture was programmed in Pytorch.

In this package, both CBOW and SG models were adapted using the incremental negative sampling technique to accelerate their training speed.

References

Kaji, N., & Kobayashi, H. (2017). Incremental skip-gram model with negative sampling. arXiv preprint arXiv:1704.03956.
Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., ... & Bifet, A. (2021). River: machine learning for streaming data in Python.

Examples:

>>> from torch.utils.data import DataLoader
>>> from rivertext.models.iw2v import IWord2Vec
>>> from rivertext.utils import TweetStream
>>> ts = TweetStream("/path/to/tweets.txt")
>>> dataloader = DataLoader(ts, batch_size=32)
>>> iw2v = IWord2Vec(
...    window_size=3,
...    vocab_size=3
...    emb_size=3,
...    sg=0,
...    neg_samples_sum=1,
...    device="cuda:0"
... )
>>> for batch in dataloader:
...    iw2v.learn_many(batch)
>>> iw2v.vocab2dict()
{'hello': [0.77816248, 0.99913448, 0.14790398],
'are': [0.86127345, 0.24901696, 0.28613529],
'you': [0.64463917, 0.9003653 , 0.26000987],
'this': [0.97007572, 0.08310498, 0.61532574],
'example':  [0.74144294, 0.77877194, 0.67438642]
}
>>>  iw2v.transform_one('hello')
[0.77816248, 0.99913448, 0.14790398]

Source code in rivertext/models/iw2v.py

class IWord2Vec(IWVBase):
    """Word2Vec incremental architectures is an adaptation of the popular word2vec
    proposed by Mikolov et al. to the streaming scenario. To adapt these algorithms to
    a streaming setting, we rely on the Incremental SkipGram with Negative Sampling
    model proposed by Kaji et al. The main assumptions we consider are:

    1. The models must deal with the fact that the vocabulary is dynamic and unknown,
        so the structures are updated as it is trained.
    2. The unigram table is created incrementally using the algorithm proposed by
        Kaji et al.
    3. The internal structure of the architecture was programmed in Pytorch.

    In this package, both CBOW and SG models were adapted using the incremental negative
        sampling technique to accelerate their training speed.

    References:
        1. Kaji, N., & Kobayashi, H. (2017). Incremental skip-gram model with negative
            sampling. arXiv preprint arXiv:1704.03956.
        2. Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse,
            R., ... & Bifet, A. (2021). River: machine learning for streaming data in
            Python.
    Examples:
        >>> from torch.utils.data import DataLoader
        >>> from rivertext.models.iw2v import IWord2Vec
        >>> from rivertext.utils import TweetStream
        >>> ts = TweetStream("/path/to/tweets.txt")
        >>> dataloader = DataLoader(ts, batch_size=32)
        >>> iw2v = IWord2Vec(
        ...    window_size=3,
        ...    vocab_size=3
        ...    emb_size=3,
        ...    sg=0,
        ...    neg_samples_sum=1,
        ...    device="cuda:0"
        ... )
        >>> for batch in dataloader:
        ...    iw2v.learn_many(batch)
        >>> iw2v.vocab2dict()
        {'hello': [0.77816248, 0.99913448, 0.14790398],
        'are': [0.86127345, 0.24901696, 0.28613529],
        'you': [0.64463917, 0.9003653 , 0.26000987],
        'this': [0.97007572, 0.08310498, 0.61532574],
        'example':  [0.74144294, 0.77877194, 0.67438642]
        }
        >>>  iw2v.transform_one('hello')
        [0.77816248, 0.99913448, 0.14790398]

    """

    def __init__(
        self,
        vocab_size: int = 1_000_000,
        emb_size: int = 100,
        unigram_table_size: int = 100_000_000,
        window_size: int = 5,
        alpha: float = 0.75,
        subsampling_threshold: float = 1e-3,
        neg_samples_sum: int = 10,
        sg: int = 1,
        lr: float = 0.025,
        device: str = None,
        optimizer: Optimizer = SparseAdam,
        on: str = None,
        strip_accents: bool = True,
        lowercase: bool = True,
        preprocessor=None,
        tokenizer: Callable[[str], List[str]] = None,
        ngram_range: Tuple[int, int] = (1, 1),
    ):
        """An instance of IWord2Vec class.

        Args:
            vocab_size: Vocab size, by default 1_000_000.
            emb_size: Embdding size, by default 100.
            unigram_table_size: Unigram table size, by default 100_000_000.
            window_size: Window size, by default 5
            alpha: Smoother parameter, by default 0.75
            subsampling_threshold : Subsampling parameter, by default 1e-3
            neg_samples_sum: Number of negative sampling to used, by default 10.
            sg: training algorithm, 1 for CBOW; otherwise SG.
            lr: Learning rate of the optimizer, by default 0.025
            device: Device to run the wrapped model on. Can be "cpu" or "cuda", by
                default cuda.
            optimizer: Optimizer to be used for training the model.,
                by default SparseAdam.
            on: The name of the feature that contains the text to vectorize. If `None`,
                then each `learn_one` and `transform_one` should treat `x` as a `str`
                and not as a `dict`., by default None.
            strip_accents: Whether or not to strip accent characters, by default True.
                lowercase: Whether or not to convert all characters to lowercase
                by default True.
            preprocessor: An optional preprocessing function which overrides the
                `strip_accents` and `lowercase` steps, while preserving the tokenizing
                and n-grams generation steps., by default None
            tokenizer: A function used to convert preprocessed text into a `dict` of
                tokens. A default tokenizer is used if `None` is passed. Set to `False`
                to disable tokenization, by default None.
            ngram_range: The lower and upper boundary of the range n-grams to be
                extracted. All values of n such that `min_n <= n <= max_n` will be used.
                For example an `ngram_range` of `(1, 1)` means only unigrams, `(1, 2)`
                means unigrams and bigrams, and `(2, 2)` means only bigrams, by default
                (1, 1).

        """

        super().__init__(
            vocab_size,
            emb_size,
            window_size,
            on=on,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            ngram_range=ngram_range,
        )

        self.neg_sample_num = neg_samples_sum
        self.sg = sg

        if sg:
            self.model_name = "ISG"
            self.model = SG(self.vocab_size, emb_size)
            self.prep = PrepSG(
                vocab_size=vocab_size,
                unigram_table_size=unigram_table_size,
                window_size=window_size,
                alpha=alpha,
                subsampling_threshold=subsampling_threshold,
                neg_samples_sum=neg_samples_sum,
                tokenizer=tokenizer,
            )
            self.optimizer = optimizer(self.model.parameters(), lr=lr)

        else:
            self.model_name = "ICBOW"
            self.model = CBOW(vocab_size, emb_size)
            self.prep = PrepCbow(
                vocab_size=vocab_size,
                unigram_table_size=unigram_table_size,
                window_size=window_size,
                alpha=alpha,
                subsampling_threshold=subsampling_threshold,
                neg_samples_sum=neg_samples_sum,
                tokenizer=tokenizer,
            )
            self.optimizer = optimizer(self.model.parameters(), lr=0.05)
        self.device = device
        self.model.to(self.device)

    def vocab2dict(self) -> Dict[str, np.ndarray]:
        """Converts the vocabulary in a dictionary of embeddings.

        Returns:
            An dict where the words are the keys, and their values are the
                embedding vectors.
        """
        embeddings = {}
        for word in tqdm(self.prep.vocab.word2idx.keys()):
            embeddings[word] = self.transform_one(word)
        return embeddings

    def transform_one(self, x: str) -> np.ndarray:
        """Obtain the vector embedding of a word.

        Args:
            x: word to obtain the embedding.

        Returns:
            The vector embedding of the word.
        """
        word_idx = self.prep.vocab[x]
        return self.model.get_embedding(word_idx)

    def learn_one(self, x: str, **kwargs) -> None:
        """Train one instance of text feature.

        Args:
            x: one line of text.

        Examples:
            >>> from torch.utils.data import DataLoader
            >>> from rivertext.models.iw2v import IWord2Vec
            >>> from rivertext.utils import TweetStream
            >>> ts = TweetStream("/path/to/tweets.txt")
            >>> dataloader = DataLoader(ts)
            >>> iw2v = IWord2Vec(
            ...    window_size=3,
            ...    vocab_size=3
            ...    emb_size=3,
            ...    sg=0,
            ...    neg_samples_sum=1,
            ...    device="cuda:0"
            ... )
            >>> for tweet in dataloader:
            ...    iw2v.learn_one(tweet)
            >>> iw2v.vocab2dict()
            {'hello': [0.77816248, 0.99913448, 0.14790398],
            'are': [0.86127345, 0.24901696, 0.28613529],
            'you': [0.64463917, 0.9003653 , 0.26000987],
            'this': [0.97007572, 0.08310498, 0.61532574],
            'example':  [0.74144294, 0.77877194, 0.67438642]
            }
            >>>  iw2v.transform_one('hello')
            [0.77816248, 0.99913448, 0.14790398]

        """
        tokens = self.process_text(x[0])
        batch = self.prep(tokens)
        targets = batch[0].to(self.device)
        contexts = batch[1].to(self.device)
        neg_samples = batch[2].to(self.device)

        self.optimizer.zero_grad()
        loss = self.model(targets, contexts, neg_samples)
        loss.backward()
        self.optimizer.step()

    def learn_many(self, X: List[str], y=None, **kwargs) -> None:
        """Train a mini-batch of text features.

        Args:
            X: A list of sentence features.
            y: A series of target values, by default None.

        Examples:
            >>> from torch.utils.data import DataLoader
            >>> from rivertext.models.iw2v import IWord2Vec
            >>> from rivertext.utils import TweetStream
            >>> ts = TweetStream("/path/to//tweets.txt")
            >>> dataloader = DataLoader(ts, batch_size=32)
            >>> iw2v = IWord2Vec(
            ...    window_size=3,
            ...    vocab_size=3
            ...    emb_size=3,
            ...    sg=0,
            ...    neg_samples_sum=1,
            ...    device="cuda:0"
            ... )
            >>> for batch in dataloader:
            ...    iw2v.learn_many(batch)
            >>> iw2v.vocab2dict()
            {'hello': [0.77816248, 0.99913448, 0.14790398],
            'are': [0.86127345, 0.24901696, 0.28613529],
            'you': [0.64463917, 0.9003653 , 0.26000987],
            'this': [0.97007572, 0.08310498, 0.61532574],
            'example':  [0.74144294, 0.77877194, 0.67438642]
            }
            >>>  wcm.transform_one('hello')
            [0.77816248, 0.99913448, 0.14790398]
        """

        tokens = list(map(self.process_text, X))
        batch = self.prep(tokens)
        targets = batch[0].to(self.device)
        contexts = batch[1].to(self.device)
        neg_samples = batch[2].to(self.device)

        self.optimizer.zero_grad()
        loss = self.model(targets, contexts, neg_samples)
        loss.backward()
        self.optimizer.step()

`init(vocab_size=1000000, emb_size=100, unigram_table_size=100000000, window_size=5, alpha=0.75, subsampling_threshold=0.001, neg_samples_sum=10, sg=1, lr=0.025, device=None, optimizer=SparseAdam, on=None, strip_accents=True, lowercase=True, preprocessor=None, tokenizer=None, ngram_range=(1, 1))`

An instance of IWord2Vec class.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int`	Vocab size, by default 1_000_000.	`1000000`
`emb_size`	`int`	Embdding size, by default 100.	`100`
`unigram_table_size`	`int`	Unigram table size, by default 100_000_000.	`100000000`
`window_size`	`int`	Window size, by default 5	`5`
`alpha`	`float`	Smoother parameter, by default 0.75	`0.75`
`subsampling_threshold`		Subsampling parameter, by default 1e-3	`0.001`
`neg_samples_sum`	`int`	Number of negative sampling to used, by default 10.	`10`
`sg`	`int`	training algorithm, 1 for CBOW; otherwise SG.	`1`
`lr`	`float`	Learning rate of the optimizer, by default 0.025	`0.025`
`device`	`str`	Device to run the wrapped model on. Can be "cpu" or "cuda", by default cuda.	`None`
`optimizer`	`Optimizer`	Optimizer to be used for training the model., by default SparseAdam.	`SparseAdam`
`on`	`str`	The name of the feature that contains the text to vectorize. If `None`, then each `learn_one` and `transform_one` should treat `x` as a `str` and not as a `dict`., by default None.	`None`
`strip_accents`	`bool`	Whether or not to strip accent characters, by default True. lowercase: Whether or not to convert all characters to lowercase by default True.	`True`
`preprocessor`		An optional preprocessing function which overrides the `strip_accents` and `lowercase` steps, while preserving the tokenizing and n-grams generation steps., by default None	`None`
`tokenizer`	`Callable[[str], List[str]]`	A function used to convert preprocessed text into a `dict` of tokens. A default tokenizer is used if `None` is passed. Set to `False` to disable tokenization, by default None.	`None`
`ngram_range`	`Tuple[int, int]`	The lower and upper boundary of the range n-grams to be extracted. All values of n such that `min_n <= n <= max_n` will be used. For example an `ngram_range` of `(1, 1)` means only unigrams, `(1, 2)` means unigrams and bigrams, and `(2, 2)` means only bigrams, by default (1, 1).	`(1, 1)`

Source code in rivertext/models/iw2v.py

def __init__(
    self,
    vocab_size: int = 1_000_000,
    emb_size: int = 100,
    unigram_table_size: int = 100_000_000,
    window_size: int = 5,
    alpha: float = 0.75,
    subsampling_threshold: float = 1e-3,
    neg_samples_sum: int = 10,
    sg: int = 1,
    lr: float = 0.025,
    device: str = None,
    optimizer: Optimizer = SparseAdam,
    on: str = None,
    strip_accents: bool = True,
    lowercase: bool = True,
    preprocessor=None,
    tokenizer: Callable[[str], List[str]] = None,
    ngram_range: Tuple[int, int] = (1, 1),
):
    """An instance of IWord2Vec class.

    Args:
        vocab_size: Vocab size, by default 1_000_000.
        emb_size: Embdding size, by default 100.
        unigram_table_size: Unigram table size, by default 100_000_000.
        window_size: Window size, by default 5
        alpha: Smoother parameter, by default 0.75
        subsampling_threshold : Subsampling parameter, by default 1e-3
        neg_samples_sum: Number of negative sampling to used, by default 10.
        sg: training algorithm, 1 for CBOW; otherwise SG.
        lr: Learning rate of the optimizer, by default 0.025
        device: Device to run the wrapped model on. Can be "cpu" or "cuda", by
            default cuda.
        optimizer: Optimizer to be used for training the model.,
            by default SparseAdam.
        on: The name of the feature that contains the text to vectorize. If `None`,
            then each `learn_one` and `transform_one` should treat `x` as a `str`
            and not as a `dict`., by default None.
        strip_accents: Whether or not to strip accent characters, by default True.
            lowercase: Whether or not to convert all characters to lowercase
            by default True.
        preprocessor: An optional preprocessing function which overrides the
            `strip_accents` and `lowercase` steps, while preserving the tokenizing
            and n-grams generation steps., by default None
        tokenizer: A function used to convert preprocessed text into a `dict` of
            tokens. A default tokenizer is used if `None` is passed. Set to `False`
            to disable tokenization, by default None.
        ngram_range: The lower and upper boundary of the range n-grams to be
            extracted. All values of n such that `min_n <= n <= max_n` will be used.
            For example an `ngram_range` of `(1, 1)` means only unigrams, `(1, 2)`
            means unigrams and bigrams, and `(2, 2)` means only bigrams, by default
            (1, 1).

    """

    super().__init__(
        vocab_size,
        emb_size,
        window_size,
        on=on,
        strip_accents=strip_accents,
        lowercase=lowercase,
        preprocessor=preprocessor,
        tokenizer=tokenizer,
        ngram_range=ngram_range,
    )

    self.neg_sample_num = neg_samples_sum
    self.sg = sg

    if sg:
        self.model_name = "ISG"
        self.model = SG(self.vocab_size, emb_size)
        self.prep = PrepSG(
            vocab_size=vocab_size,
            unigram_table_size=unigram_table_size,
            window_size=window_size,
            alpha=alpha,
            subsampling_threshold=subsampling_threshold,
            neg_samples_sum=neg_samples_sum,
            tokenizer=tokenizer,
        )
        self.optimizer = optimizer(self.model.parameters(), lr=lr)

    else:
        self.model_name = "ICBOW"
        self.model = CBOW(vocab_size, emb_size)
        self.prep = PrepCbow(
            vocab_size=vocab_size,
            unigram_table_size=unigram_table_size,
            window_size=window_size,
            alpha=alpha,
            subsampling_threshold=subsampling_threshold,
            neg_samples_sum=neg_samples_sum,
            tokenizer=tokenizer,
        )
        self.optimizer = optimizer(self.model.parameters(), lr=0.05)
    self.device = device
    self.model.to(self.device)

`learn_many(X, y=None, **kwargs)`

Train a mini-batch of text features.

Parameters:

Name	Type	Description	Default
`X`	`List[str]`	A list of sentence features.	required
`y`		A series of target values, by default None.	`None`

Examples:

>>> from torch.utils.data import DataLoader
>>> from rivertext.models.iw2v import IWord2Vec
>>> from rivertext.utils import TweetStream
>>> ts = TweetStream("/path/to//tweets.txt")
>>> dataloader = DataLoader(ts, batch_size=32)
>>> iw2v = IWord2Vec(
...    window_size=3,
...    vocab_size=3
...    emb_size=3,
...    sg=0,
...    neg_samples_sum=1,
...    device="cuda:0"
... )
>>> for batch in dataloader:
...    iw2v.learn_many(batch)
>>> iw2v.vocab2dict()
{'hello': [0.77816248, 0.99913448, 0.14790398],
'are': [0.86127345, 0.24901696, 0.28613529],
'you': [0.64463917, 0.9003653 , 0.26000987],
'this': [0.97007572, 0.08310498, 0.61532574],
'example':  [0.74144294, 0.77877194, 0.67438642]
}
>>>  wcm.transform_one('hello')
[0.77816248, 0.99913448, 0.14790398]

Source code in rivertext/models/iw2v.py

def learn_many(self, X: List[str], y=None, **kwargs) -> None:
    """Train a mini-batch of text features.

    Args:
        X: A list of sentence features.
        y: A series of target values, by default None.

    Examples:
        >>> from torch.utils.data import DataLoader
        >>> from rivertext.models.iw2v import IWord2Vec
        >>> from rivertext.utils import TweetStream
        >>> ts = TweetStream("/path/to//tweets.txt")
        >>> dataloader = DataLoader(ts, batch_size=32)
        >>> iw2v = IWord2Vec(
        ...    window_size=3,
        ...    vocab_size=3
        ...    emb_size=3,
        ...    sg=0,
        ...    neg_samples_sum=1,
        ...    device="cuda:0"
        ... )
        >>> for batch in dataloader:
        ...    iw2v.learn_many(batch)
        >>> iw2v.vocab2dict()
        {'hello': [0.77816248, 0.99913448, 0.14790398],
        'are': [0.86127345, 0.24901696, 0.28613529],
        'you': [0.64463917, 0.9003653 , 0.26000987],
        'this': [0.97007572, 0.08310498, 0.61532574],
        'example':  [0.74144294, 0.77877194, 0.67438642]
        }
        >>>  wcm.transform_one('hello')
        [0.77816248, 0.99913448, 0.14790398]
    """

    tokens = list(map(self.process_text, X))
    batch = self.prep(tokens)
    targets = batch[0].to(self.device)
    contexts = batch[1].to(self.device)
    neg_samples = batch[2].to(self.device)

    self.optimizer.zero_grad()
    loss = self.model(targets, contexts, neg_samples)
    loss.backward()
    self.optimizer.step()

`learn_one(x, **kwargs)`

Train one instance of text feature.

Parameters:

Name	Type	Description	Default
`x`	`str`	one line of text.	required

Examples:

>>> from torch.utils.data import DataLoader
>>> from rivertext.models.iw2v import IWord2Vec
>>> from rivertext.utils import TweetStream
>>> ts = TweetStream("/path/to/tweets.txt")
>>> dataloader = DataLoader(ts)
>>> iw2v = IWord2Vec(
...    window_size=3,
...    vocab_size=3
...    emb_size=3,
...    sg=0,
...    neg_samples_sum=1,
...    device="cuda:0"
... )
>>> for tweet in dataloader:
...    iw2v.learn_one(tweet)
>>> iw2v.vocab2dict()
{'hello': [0.77816248, 0.99913448, 0.14790398],
'are': [0.86127345, 0.24901696, 0.28613529],
'you': [0.64463917, 0.9003653 , 0.26000987],
'this': [0.97007572, 0.08310498, 0.61532574],
'example':  [0.74144294, 0.77877194, 0.67438642]
}
>>>  iw2v.transform_one('hello')
[0.77816248, 0.99913448, 0.14790398]

Source code in rivertext/models/iw2v.py

def learn_one(self, x: str, **kwargs) -> None:
    """Train one instance of text feature.

    Args:
        x: one line of text.

    Examples:
        >>> from torch.utils.data import DataLoader
        >>> from rivertext.models.iw2v import IWord2Vec
        >>> from rivertext.utils import TweetStream
        >>> ts = TweetStream("/path/to/tweets.txt")
        >>> dataloader = DataLoader(ts)
        >>> iw2v = IWord2Vec(
        ...    window_size=3,
        ...    vocab_size=3
        ...    emb_size=3,
        ...    sg=0,
        ...    neg_samples_sum=1,
        ...    device="cuda:0"
        ... )
        >>> for tweet in dataloader:
        ...    iw2v.learn_one(tweet)
        >>> iw2v.vocab2dict()
        {'hello': [0.77816248, 0.99913448, 0.14790398],
        'are': [0.86127345, 0.24901696, 0.28613529],
        'you': [0.64463917, 0.9003653 , 0.26000987],
        'this': [0.97007572, 0.08310498, 0.61532574],
        'example':  [0.74144294, 0.77877194, 0.67438642]
        }
        >>>  iw2v.transform_one('hello')
        [0.77816248, 0.99913448, 0.14790398]

    """
    tokens = self.process_text(x[0])
    batch = self.prep(tokens)
    targets = batch[0].to(self.device)
    contexts = batch[1].to(self.device)
    neg_samples = batch[2].to(self.device)

    self.optimizer.zero_grad()
    loss = self.model(targets, contexts, neg_samples)
    loss.backward()
    self.optimizer.step()

`transform_one(x)`

Obtain the vector embedding of a word.

Parameters:

Name	Type	Description	Default
`x`	`str`	word to obtain the embedding.	required

Returns:

Type	Description
`np.ndarray`	The vector embedding of the word.

Source code in rivertext/models/iw2v.py

def transform_one(self, x: str) -> np.ndarray:
    """Obtain the vector embedding of a word.

    Args:
        x: word to obtain the embedding.

    Returns:
        The vector embedding of the word.
    """
    word_idx = self.prep.vocab[x]
    return self.model.get_embedding(word_idx)

`vocab2dict()`

Converts the vocabulary in a dictionary of embeddings.

Returns:

Type	Description
`Dict[str, np.ndarray]`	An dict where the words are the keys, and their values are the embedding vectors.

Source code in rivertext/models/iw2v.py

def vocab2dict(self) -> Dict[str, np.ndarray]:
    """Converts the vocabulary in a dictionary of embeddings.

    Returns:
        An dict where the words are the keys, and their values are the
            embedding vectors.
    """
    embeddings = {}
    for word in tqdm(self.prep.vocab.word2idx.keys()):
        embeddings[word] = self.transform_one(word)
    return embeddings

word2vec

IWord2Vec

__init__(vocab_size=1000000, emb_size=100, unigram_table_size=100000000, window_size=5, alpha=0.75, subsampling_threshold=0.001, neg_samples_sum=10, sg=1, lr=0.025, device=None, optimizer=SparseAdam, on=None, strip_accents=True, lowercase=True, preprocessor=None, tokenizer=None, ngram_range=(1, 1))

learn_many(X, y=None, **kwargs)

learn_one(x, **kwargs)

transform_one(x)

vocab2dict()

`IWord2Vec`

`init(vocab_size=1000000, emb_size=100, unigram_table_size=100000000, window_size=5, alpha=0.75, subsampling_threshold=0.001, neg_samples_sum=10, sg=1, lr=0.025, device=None, optimizer=SparseAdam, on=None, strip_accents=True, lowercase=True, preprocessor=None, tokenizer=None, ngram_range=(1, 1))`

`learn_many(X, y=None, **kwargs)`

`learn_one(x, **kwargs)`

`transform_one(x)`

`vocab2dict()`