Real-World Forecasting with Deep Learning: How We Do It at Wix

In recent years, deep learning (DL) models have shown remarkable potential in the field of forecasting. Models such as DeepAR, neuralProphet , PatchTST, Temporal Fusion Transformer, NHiTS, and TiDE have achieved state-of-the-art performance on traditional academic benchmarks. However, despite their promise, these models often fall short when applied to the complexities of real-world forecasting.

In this post, I’ll discuss these challenges and share how we at Wix successfully tackled them, resulting in a 40% improvement in our collections income forecast. This significant improvement translates into better decision-making capabilities for our company stakeholders. For a quick overview of the challenges we faced and the solutions we implemented, refer to the summary table.

In the following sections, I will outline the specific challenges we encountered and describe the solutions we implemented to overcome them.

Challenges

1. Forecasting Tasks Requires Frequent Model Retraining

Forecasting models in production must be constantly re-trained to maintain their performance over time. Unlike other deep learning tasks, such as computer vision or NLP, forecasting tasks are highly sensitive to changes in the underlying data patterns. In NLP, the structure of language remains relatively stable.

However, in forecasting, the patterns in time-series data can evolve rapidly due to seasonal variations, market trends, and unexpected events. Without regular updates, the model’s predictions become less accurate, making frequent retraining essential for maintaining reliable forecasts.

Moreover, many advanced deep learning models, especially transformer-based ones like Temporal Fusion Transformer (TFT) and PatchTST, require significant computational resources and time to train. This high cost and time requirement can be a major barrier to their practical implementation, particularly in environments where quick turnaround is essential. As a result, the continuous updating process needs to be fast, automatic, and resource-efficient to be practical in a real-world setting.

2. High Forecast Revisions (Variance)

After retraining, while a model’s performance might appear stable in terms of accuracy, its forecasts for the distant future can change significantly. From our experience, deep learning models tend to have much higher revisions (i.e. variance in the forecasted collections) between different training points compared to traditional machine learning models. These frequent and substantial changes are problematic as stakeholders need to trust the model’s output. If the forecasts change significantly too often, it becomes challenging for stakeholders to rely on the model for decision-making.

Illustrating the revision practical implications using simulation. Here, we simulated a forecast with a mean of 1000 and a variance of 5), changing weekly. The relative low variance has tremendous effect on future forecasts, and the relative low percentage changes week on week has a high monetary value and can reach a $15M week on week change. Stakeholders a have hard time making decisions based on this behavior.

3. Dependency on External Factors

The accuracy of a time-series forecast often depends on other related business behaviors, such as sales events or product pricing in the case of collections forecasting. Not all deep learning models (e.g. N-beats and some of the new LLM based models) support the inclusion of these external features, and models that cannot incorporate these features may not be suitable for certain forecasting tasks, such as in our use case.

Simulation of the differences in the 2023 collections Illustrating the impact of promotional sale days on the collections. Without considering the sale days , any model would be much less accurate.

4. Data Splitting and Recent Trends

A fundamental principle in deep learning is the division of data into training and validation sets, with training typically halting when the validation loss stops decreasing in order to avoid overfitting. In time-series forecasting, the validation set often includes the most recent data points.

This approach, however, can be limiting for long-term forecasts. The model might not incorporate the most recent trends effectively, as this data is reserved for validation. Consequently, the model may miss out on the latest patterns and behaviors, which are crucial for making accurate long-term predictions.

5. Hyper-Parameter Tuning

Choosing the optimal hyper-parameters for a model is crucial for its performance. In traditional machine learning, this is often achieved through rolling cross-validation. This technique involves training the model on various training periods and evaluating its performance multiple times across these different periods.

However, deep learning models usually require significantly more time to train compared to traditional machine learning models. This makes the process of hyper-parameter tuning with rolling cross validation much more time-consuming and computationally expensive. Conducting a thorough hyper-parameter search for deep learning models using the same approach as traditional machine learning would be impractical due to the enormous amount of time and resources required.

Inefficient hyper-parameter tuning can lead to suboptimal model performance, impacting forecasting accuracy and overall business outcomes.

Our Solution(s) — Leveraging TiDE for Collections Income forecasting

Last year, Google researchers introduced TiDE (Time-series Dense Encoder), a cutting-edge long-term forecasting model featuring an encoder-decoder architecture constructed with Multilayer Perceptrons (MLPs).

In their paper, “Long-term Forecasting with TiDE: Time-series Dense Encoder”, the authors showcase TiDE’s state-of-the-art performance across various datasets, surpassing all other transformer based models on long horizon tasks.

On top of the great results over academic datasets, we tested TiDE among a few other models and got great results in terms of accuracy (e.g. low MAPEs) and applicability to real word applications.

For a deeper dive, check the article or this great blog post.

In the next part, we explain how we leveraged TiDE features together with some smart procedures in order to get to a viable and consistent model in production:

Retraining the model becomes easy with TiDE

TiDE, unlike recent trends, is based on MLPs. This seemingly technical detail translates into a practical benefit — significantly faster training times. For example, the authors of TiDE showed it can be trained by a staggering 10 to 40 times faster compared to PatchTST. This dramatic reduction in training time makes it a perfect fit for retraining on a regular basis, enabling us to leverage the power of deep learning while keeping retraining costs and timelines under control.

Comparing TiDE train time with PatchTST, with respect to model look-back window, taken from Long-term Forecasting with TiDE article. TiDE training is more than a magnitude faster.

2. Tackling the revisions problem with ensemble forecasting

The following solution can be applied to any model which we retrain in regularly. Instead of solely relying on the most recent retrained model, we adopted an ensemble forecasting strategy. Here’s how it works: Assuming weekly retraining, we take the average of the forecasts generated by the last X trained models (in our case, X=4).

This approach effectively smoothens out drastic revisions and injects stability into the overall forecast. As an added benefit, this technique also introduces an ensemble effect, essentially combining the strengths of multiple models to potentially enhance the forecast’s accuracy (as we observed in our use case).

While simple averaging serves as a solid baseline, it’s essential to acknowledge its limitations. More sophisticated ensemble techniques, such as weighted averaging, model selection could potentially yield even better results.

Exploring these advanced methods represents an exciting avenue for future research in our forecasting pipeline. Despite the simplicity of our current approach, we found it to be remarkably effective in reducing forecast revisions (variance) and improving overall stability, making it a valuable component of our forecasting strategy.

Simulation of the differences in the 2023 collections forecast between train dates of TiDE when retrained weekly to the ensemble version. The Ensemble model is much more stable and although the model learns from new trends and increases the forecast, the process is gradual and stable.

3. Unlocking the Power of Exogenous Covariates

TiDE offers unparalleled flexibility in incorporating exogenous variables, or external factors, into the forecasting process. At Wix, our collections on a given day can be influenced by various factors we want to consider when we forecast the future, from special sales discounts to the trend of the number of new users signing up.

TiDE supports modeling the impact of both future known covariates, such as planned sale days (impact is modeled based on past data), and the impact of uncertain factors in the future like incoming user traffic. This capability is significant as it enhances forecast accuracy and reduces error by allowing us to account for a wide range of influences on our collections.

4. Splitting the train and validation set in smart way

To overcome the tradeoff between the need to learn from recent trends to avoid overfitting on the training phase, we implemented a novel validation strategy. Instead of a separate validation set, we designated the final portion of the training set itself as the validation set.

While this approach technically violates the separation between training and validation data, in our use case (collections forecasting), the benefits outweigh the drawbacks. By allowing the model to “overfit” to the most recent data within the training set, we empirically observed a significant improvement in the model’s accuracy for long-horizon forecasts on the untouched test sets.

Showcasing the new train-split validation and difference from the classic split

5. Smart hyperparameters tuning with rolling cross validation using Optuna

Here, once again, the exceptional training speed of the TiDE model came to our rescue. Coupled with Optuna, a powerful Bayesian optimization library, allowed us to streamline the hyperparameter tuning process. Optuna utilizes a Bayesian approach to efficiently explore different hyperparameter configurations while minimizing the number of training runs required.

Rolling cross validation (also known as forward chaining) is a heavy process. Since we want to evaluate the model performance across the year, we backtest and evaluate the model at the end of each month. Therefore, in order to complete a single evaluation of hyper-parameters we go through at least 12 full training circles and average the performance across the different train dates.

The combination of TiDE’s speed and Optuna’s intelligent search enabled us to rapidly identify the best possible model configuration within a time-efficient framework. For this stage only, we leveraged the processing power of a single GPUs to expedite the evaluation of hundreds of hyperparameter configurations using rolling cross-validation, all within a matter of hours.

Conclusion

Navigating the intricacies of real-world forecasting with deep learning is no small feat. At Wix, we’ve faced and addressed numerous challenges head-on, from the need for constant model retraining and managing high computational costs to handling forecast revisions (variance) and incorporating external factors. Our journey led us to adopt TiDE, a model that not only offers state-of-the-art accuracy but also provides practical benefits like faster training times and enhanced flexibility.

By leveraging innovative solutions such as ensemble forecasting, smart validation strategies, and efficient hyperparameter tuning with tools like Optuna, we’ve achieved a 40% improvement in our collections income forecast. This success underscores the importance of tailoring advanced models to meet real-world demands, ensuring they deliver reliable and actionable forecasts.

As the field of deep learning continues to evolve, the lessons we’ve learned and the strategies we’ve implemented can serve as a valuable blueprint for other organizations looking to harness the power of deep learning for forecasting. Our experience at Wix is a testament to the potential of combining cutting-edge research with practical, real-world applications to drive meaningful business outcomes.

In a follow-up post, we will delve deeper into the specifics of TiDE and provide a detailed walkthrough notebook of our entire process. Stay tuned for an in-depth guide that will help you apply these techniques to your own forecasting challenges.

This post was written by Ariel Berger

More of Wix Engineering's updates and insights:

Follow us on: Twitter | Facebook | LinkedIn | TikTok
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google

Real-World Forecasting with Deep Learning: How We Do It at Wix

Challenges

Our Solution(s) — Leveraging TiDE for Collections Income forecasting

Conclusion

Recent Posts

Comments