TweetStream and Dataloader¶
Stream of Tweets¶
Our tutorial uses a dataset of unlabeled tweets to simulate a text stream of tweets. Twitter provides an excellent source of text streams, given its widespread use and real-time updates from its users. We draw a set of ten million tweets in English from the Edinburgh corpus. This dataset is a collection of tweets from different languages for academic purposes and was downloaded from November 2009 to February 2010 using the Twitter API. The dataset consists of 1,000,000 tweets in a text file, with each tweet occupying one line and separated by a line break.
The file can be downloaded from here.
TweetStream Class¶
To efficiently load and read larger text files that may not fit into memory, we used the IterableDataset
class of the Pytorch API, which is an extension of the IterableDataset class. We then utilized the data loader provided by Pytorch to load the iterable dataset. This allowed us to efficiently access the data without having to store large amounts of it in memory.
import libraries¶
from rivertext.utils import TweetStream
from torch.utils.data import DataLoader
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-1-1a4b0437c1fd> in <module> ----> 1 from rivertext.utils import TweetStream 2 from torch.utils.data import DataLoader ModuleNotFoundError: No module named 'rivertext'
Load the Text Stream¶
ts = TweetStream("tweets.txt")
dataloader = DataLoader(ts, batch_size=1)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-2-11263f35c08d> in <module> ----> 1 ts = TweetStream("tweets.txt") 2 dataloader = DataLoader(ts, batch_size=1) NameError: name 'TweetStream' is not defined
for tweet in dataloader:
print(tweet)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-e5e42fc80fc1> in <module> ----> 1 for tweet in dataloader: 2 print(tweet) NameError: name 'dataloader' is not defined