Hoeffding Trees, a variant of Decision Trees which operate in a streaming fashion can be used for analyzing huge datasets without storing large amounts of data.
Wassily Hoeffding invented an inequality which quantifies the probability of a random variable's magnitude; randomly creating samples of large data and leading to determine which sampled subset is appropriate for use in a coarser model.
Hoeffding Trees can be utilized to accurately train a model, even on every tweet made on Twitter while using almost no local memory, by grouping data into subsamples and minimizing costs.
Hoeffding Trees offer an easy and effective means of reducing memory complexity by keeping track of training data while we also do not have to store the subsamples of data used in the grouping process.
These Trees are not only used in classification tasks but also regression tasks for other types of data. Such methods can be used for new data points coming from social media.
Hoeffding Trees offer a feasible solution to the constraint in training decision tree models on large datasets with an endless stream of new data points coming in through social media.
As opposed to a standard model which finds the entirety of training data, Hoeffding Trees operate while subsampling the correct amount of training data thus building the tree with quite accurate yet incomplete information.
This article provides a high-level overview of advanced technical methods that could be used in live streaming which required the creation of a workflow which was developed after studying Hoeffding Trees.
Hoeffding Trees are incredibly fast, cheap and accurate which is made possible by restricting each internal node of the tree to subsamples regardless of the size of the dataset.
Overall we can safely train our models on a huge volume of data — and stream it — without worrying about RAM and other processing power issues.