Web Traffic Time Series Forecasting

Introduction and Overview of the Problem

Web Traffic Forecasting is a problem of forecasting the future page views that we can receive for given Wikipedia articles. The prediction to be carried out based on the available time series data. The data for this problem can be downloaded from here.

Forecasting Page Views of articles

In this problem, we are provided with Page Views of approx. 145k Wikipedia articles. Our goal is to forecast the Page Views for these articles for future timestamps. Let’s understand about the data from Exploratory Data Analysis (EDA).

Exploratory Data Analysis (EDA) :

With Train_*.csv being the base file in this problem, we are given information about Page Name and it’s Page View for given dates.

Train data with Page Name and it’s respective page views for mentioned date

From above, we can see that the Page Name can become important feature in understanding the data distribution, as page name contains ;

We can create new features for that page by extracting above mentioned information from page name.

Extra Features for each page, driven from its Page Name

Web Traffic : Site Name Feature

data_counts  = pageData['Site_Name'].value_counts()

More traffic was observed from Wikipedia website. Lower traffic was observed from WikiMedia and MediaWiki pages.

Web Traffic : Site Language Feature

data_counts  = pageData['Language'].value_counts()

More Traffic was observed with pages having “English” language which is commonly spoken language. “www” and “commons” recorded least amount of traffic for these pages.

Web Traffic : Page Access Feature

data_counts  = pageData['Access'].value_counts()

There are three major source from which these web pages are accessed from, viz.

From above Bar plots, we can see that the traffic is more for pages with all-access as access category. The remaining two categories shows comparatively less traffic. Also, Mobile-web traffic is slightly more compared to that of Desktop based traffic.

Web Traffic : Agent Feature

data_counts = pageData['Agent'].value_counts()

For all pages, we are having two agents through which the pages are being accessed. All-agents being the agent through which most of the traffic was observed from compared to Spider agent.

Time Series : Mean Page Views

mean = timeseries.mean(axis=0)

From this timely distribution of the data, we can see that the data is not stationary over time and also there is no combining effect was observed in the data i.e. strict increase in the page views over the increase in time. There can be sharp spikes seen for the page views for certain days, considering them to be holidays, weekends. Let us understand if the time series nature of data has some correlation with it’s own lag value.

Time Series : Correlation with Lagged values with ACF and PACF

mean = timeseries.mean(axis=0)
acf_plot = acf((mean.values), nlags=len(mean))

Figure shows, ACF (Auto Correlation) for the complete time series. We can clearly see that the page views are strongly correlated with their previous values.

Plotting ACF plot with different lags, to understand the correlation in depth.

With four different plots of correlation of Mean page views with its lagged values, we can see that the correlation is high in Weekly, Monthly and Quarterly lags.

We can see that the high correlation among the mean pageview drops below threshold value of 90% interval in case of Yearly correlation plot, which continues to increase in negative scale for certain period.

From above Quarterly and Yearly ACF plot, we can say that the data is highly correlated with the lagged values of the same signals for each page views. Let us see how the page views are related with the residual values of actual page views.

mean = timeseries.mean(axis=0)
acf_plot = pacf((mean.values), nlags=len(mean))

Partial ACF on the complete timestamps as seen from above figure shows there is a strong correlation of partial residuals of the lagged values in the Yearly manner. The correlation is not periodic as the peaks do not occur in timely fashion.

Understanding the PACF with Weekly, Monthly, Quarterly and Yearly lagged values.

With Weekly, Monthly, Quarterly and Yearly correlation for Mean PageViews of the data, no significant correlaiton was observed for shorted lag values, i.e. for Weekly, Monthly and Quarterly PACF.

Sharp spike of significant PACF can be observed after 2016–04–07, i.e. after 270 days. With this we can assume that a lag of year can become significant in determining the forecasting.

Time Series : Trend and Seasonality

result = seasonal_decompose(mean, model='multiplicative', freq=365)
trend_comp = result.trend
seasonal_comp = result.seasonal
residual_comp = result.resid
observed_val = result.observed

From the plot of PACF, we could see negative spike in the data, which can be observed from trend waveform.

With seasonality decomposition of the mean time series for all pages, we can see

With this understanding of the time series, we can prepare the data for modeling.

Data Preparation :

With total of 803 time stamps for page views of each pages, from Jul 27, 2015 till Sept 10, 2017, we can divide the initial 2 years data as Train data (from July 1, 2015 upto July 1, 2017) and remaining data can be used as Cross Validation data, i.e. for validating the parameters so as to perform well on future unseen points.

#getting the index of date
date_cols = timeseries.columns.tolist()
train_end = date_cols.index('2017-07-01')-1
#splitting the main data as train and cv
train_data = timeseries.iloc[:,:]
cv_data = timeseries.iloc[:,train_end:]

Later, we will One Hot Encode, each of the features derived from Page Name, which can be suitable for modeling.

label = LabelEncoder()train_data['Site_OHE']=label.fit_transform(train_data['Site_Name']).tolist()
train_data['Lang_OHE']=label.fit_transform(train_data['Language']).tolist()
train_data['Access_OHE']=label.fit_transform(train_data['Access']).tolist()
train_data['Agent_OHE']=label.fit_transform(train_data['Agent']).tolist()
#changing the last columns as Observed
train_data=train_data.rename(columns={timeseries.iloc[:,:train_end].columns[-1]:'Observed'})

With LabelEncoder, we encoded each Page Name features into numerical format. This features can be used with rest of the time stamps as combined features so as to train the model. We can make the same modifications on Test data too.

Now that our data is ready, we can proceed with the modeling.

Metric for Model Evaluation

Before we could talk about modeling, first let’s have a look at the metric which can be used to evaluate the performance of the ML/DL models. For this problem, we will be using Metric provided by Kaggle and that is SMAPE (Symmetric Mean Absolute Percentage Error) which is formulated as,

SMAPE formulation

Defining function for returning the SMAPE for forecast values,

#metric for evaluation which accepts Y_True and Y_Pred
def smape(ytr, ypr):
numerator = np.abs(ypr - ytr)
#print(numerator)
denominator = (np.abs(ytr) + np.abs(ypr))
#print(denominator)
return (200/len(ytr) * np.sum( numerator / denominator ))

Modeling : ML Modeling

With the prepared data, where we already extracted Page features, we will perform both Machine Learning(ML) Modeling and Deep Learning(DL) Modeling as well.

Before proceeding to our modeling, we will perform Train and Cross Validation analysis few of the previous values. This is because the time series is more likely will depend on its previous values and hence, we can select few of the lagged observations to fit the model which can forecast the values for future time stamps.

lagged_window = [ 7, 30, 90, 365 ]

Since, page views are real number, and can have any non-zero value, hence, using these values directly in our modeling can create stability issue in modeling. Hence, we will standardize the values by taking log of the original values. Also, in this case, we cannot use natural log as page views can be 0, which can create problem when we will calculate SMAPE, and hence, we will be using log(1+x) for this problem.

ML Modeling : Window Mean

In this modeling, we consider mean of previous values for forecasting. The best lag window can be found out with the performance of model in CV data where we try to predict the page views for each day and later calculate SMAPE for the forecast values.

X = np.log1p(timeseries_X.iloc[:,-lag:])
mean_X = X.mean(axis=1)
#calculating SMAPE and MAPE
train_smape = smoothed_smape(Y, mean_X)
train_mape = mape(Y, mean_X)

Above plot is the mean page views of all page for Cross Validation dataset. We can see the model is under-fit to the given data. Kaggle scorer for this model on test data is found to be 44.92666.

ML Modeling : Linear Regression

Available time series data can be fit to Linear Regressor model, with loss function of Squared Loss and Huber loss. We will fit the model on different lagged windows, so as to find on how many previous time stamps the target page views are dependent.

To the lagged windows of the available time series, we will also add extra features that we extracted from Page Name of the given articles.

Therefore, we will be training the model on mentioned losses with data containing Lagged values as well as Page Features.

X = np.log1p(history.iloc[:, -best_window:])
#adding page features to the lagged values
X = np.hstack((X_pagename, X))
log_reg = SGDRegressor(alpha=best_alpha, loss='squared_loss', \
penalty='l2', fit_intercept=True, random_state=45)
log_reg.fit(X, Y)
#saving the best fit mode
joblib.dump(log_reg, './ML_MODEL-linear_regression_squareloss.sav')
X = np.log1p(history.iloc[:, -best_window:])
#adding page features to the lagged values
X = np.hstack((X_pagename, X))
log_reg = SGDRegressor(alpha=best_alpha, loss='huber', \
penalty='l2', fit_intercept=True, random_state=45)
log_reg.fit(X, Y)
#saving the best fit mode
joblib.dump(log_reg, './ML_MODEL-linear_regression_huberloss.sav')

Compared to square loss, model tried to fit to the existing train data, value started to increase for the dates for which we can see increase in mean page views.

Data with limited features, can be best fit using Tree based classifiers and hence, we will be fitting tree based classifiers to check the model performance.

ML Modeling : Tree Based Methods

With Decision Tree being the simplest tree based method, we will fit the model on our data considering selected lag windows.

#preparing data for test run
X = np.log1p(history.iloc[:, -best_window:])
#staking it with page features
X = np.hstack((X_pagename, X))
#fitting a RF Regressor model with best parameters
dt_regressor = DecisionTreeRegressor(max_depth = best_depth,
min_samples_split = best_split, random_state=45)
dt_regressor.fit(X, Y)
#saving the model
joblib.dump(dt_regressor, './ML_MODEL-decisiontree_regressor.sav')

Clearly, we can see that the model performance is far better in Decision Tree compared to that of previous models that we saw.

With the best decision tree that we obtain from above technique, we will fit the AdaBoost Regressor which used the decision tree for n number of times so as to reduce variance. Here we minimize square loss.

#preparing data for test run
X = np.log1p(history.iloc[:, -best_window:])
#staking it with page features
X = np.hstack((X_pagename, X))
#fitting a RF Regressor model with best parameters
adaboost = AdaBoostRegressor(base_estimator = best_decision_tree,
n_estimators = best_estimators, learning_rate = best_lr,
loss='square', random_state=45)
adaboost.fit(X, Y)
#saving the model
joblib.dump(adaboost, './ML_MODEL-adaboost_regressor.sav')

Performance of AdaBoost is average considering the mean values of the Cross Validation dataset. Compared to Decision Tree, it is comparatively less effective.

With Tree based methods are working fine with the data, we can use Ensembles to even further boost the performance.

ML Modeling : Ensembles

Modeling of Random Forest will be carried out with multiple high variance models so as to reduce overall variance of the model.

We did Randomized search for this modeling, which found the best parameters for the model

#preparing data for test run
X = np.log1p(history.iloc[:, -best_window:])
#staking it with page features
X = np.hstack((X_pagename, X))
randomforest = RandomForestRegressor(n_estimators = best_estimator,
max_depth = best_depth,
min_samples_split = best_split,
n_jobs=-1, random_state=45)
randomforest.fit(X, Y)
#saving the best fit mode
import joblib
joblib.dump(randomforest, './ML_MODEL-randomforest_regression.sav')

RF model tries to get close to the actual values in the initial run but eventually predicts the page views linearly as seen above.

Now, we will model with High Bias model with XGBoost Technique.

#preparing data for test run
X = np.log1p(history.iloc[:, -best_window:])
#staking it with page features
X = np.hstack((X_pagename, X))
xgboost = xgb.XGBRegressor(objective = 'reg:squarederror',
n_estimators = best_estimators,
max_depth = best_depth,
learning_rate = best_learning_rate,
colsample_bytree = best_colsample_bytree,
n_jobs=-1, random_state=45)
xgboost.fit(X, Y)
#saving the best fit mode
joblib.dump(xgboost, './ML_MODEL-xgboost_regressor.sav')

Over forecasting can be easily seen on the CV data with XGBoost Modeling. Also, the mean values of page views are also predicted to be very high but comparing these values with the mean page view distribution, the nature of the actual observations are maintained in this.

Tabulating the performance of each of ML model on Cross Validation data as well on actual Test data. Kaggle score for the Test data is also added.

Modeling : DL Modeling

With the trending Deep Learning modeling, we can try to find the best solution possible for this time series problem. As time series is a data comprising of series of dependent values, we can make use concept of RNNs to build the modeling. Also, treating this as 1-Dimensional array, we can also build the CNN modeling for the data.

#simple DNN model
def dense_model(shape, print_summary=False):
model = Sequential()
model.add(Dense(512, activation='tanh', input_shape=(shape, )))
model.add(Dense(128, activation='tanh'))
model.add(Dense(64, activation='tanh'))
model.add(Dense(8, activation='tanh'))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam', metrics=[tf_smape])
return model
#making the train data
X = np.log1p(X_timeseries.iloc[:,-best_window-1:-1])
Y = np.log1p(X_timeseries.iloc[:,-1])

#fitting model with best parameters
model = dense_model(best_window, print_summary=True)
model.fit(X, Y, batch_size=batch, epochs=epoch, verbose=1)

#save the model weights
my_model = model.to_json()
with open("dnn_model.json", "w") as json_file:
json_file.write(my_model)
model.save_weights("dnn_model_weights.h5")

With the best lagged window, we can train the simple DNN model.

Fitting DNN model with train data

Model seems to be over-fitting to the training data. This over-fitting can be minimized with the help of hyper-parameter tuning of each of the layers of the model.

Long Short Term Memory (LSTMs) are the Deep Neural Network blocks where the sequence information can be preserved for that instance. Here, page views for article being the time dependent values, we can make use of LSTMs for our modeling.

def lstm(shape, print_summary=False):
model = Sequential()
model.add(LSTM(64, activation='tanh',
return_sequences=True, input_shape=(shape, 1)))
model.add(LSTM(64, activation='tanh'))
model.add(Dropout(0.5))
#model.add(Flatten())
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam', metrics=[tf_smape])
return model
#fiting model with best parameters
X = np.log1p(X_timeseries.iloc[:,-best_window-1:-1])
r, c = X.shape
X = np.array(X).reshape((r, c, 1))
Y = np.log1p(X_timeseries.iloc[:,-1])

#fitting model wth best parameters
clear_session()
model = lstm(best_window, print_summary=True)
model.fit(X, Y, batch_size=batch, epochs=epoch, verbose=1)

#save the model weights
my_model = model.to_json()
with open("lstm_model.json", "w") as json_file:
json_file.write(my_model)
model.save_weights("lstm_model_weights.h5")

Fitting the model with best lagged data, to test the performance on CV data.

The data was under-fitting for the initial time stamps, but later was able to match up with the page view distribution. Let us understand how the same LSTMs work when we increase the LSTM layer in the model,

we will be using early stopping for these models by increasing the epochs for which the model is going to repeat itself.

from tensorflow.keras.callbacks import EarlyStopping
dl_stopping = EarlyStopping(monitor='tf_smape', patience=3)
def complex_lstm(shape, print_summary=False):
model = Sequential()
model.add(LSTM(64, activation='tanh',
return_sequences=True, input_shape=(shape, 1)))
model.add(LSTM(64, activation='tanh', return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(8, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam', metrics=[tf_smape])
return model
#fiting model with best parameters
X = np.log1p(X_timeseries.iloc[:,-best_window-1:-1])
r, c = X.shape
X = np.array(X).reshape((r, c, 1))
Y = np.log1p(X_timeseries.iloc[:,-1])

#fitting model wth best parameters
model = complex_lstm(best_window, print_summary=True)
model.fit(X, Y, batch_size=batch, epochs=deep_epoch, \
verbose=1, callbacks=[dl_stopping])

#save the model weights
my_model = model.to_json()
with open("complex_lstm_model.json", "w") as json_file:
json_file.write(my_model)
model.save_weights("complex_lstm_model_weights.h5")

Under-fit model performance can be seen from forecasting for CV data.

Considering time series data as a 1 dimensional block of array, we can apply Convolution filter so as to extract the features to pass it to the dense layer of CNN model.

def complex_cnn_model(shape, print_summary=False):
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=2, padding='same', \
activation='relu',
kernel_initializer = 'he_uniform',
input_shape=(shape, 1)))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', \
activation='relu',
kernel_initializer = 'he_uniform'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', \
activation='relu',
kernel_initializer = 'he_uniform'))
model.add(Conv1D(filters=32, kernel_size=2, \
activation='relu',
kernel_initializer = 'he_uniform'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam', metrics=[tf_smape])
return model

Let us merge the two modeling techniques, which is CNN and LSTM to make a new model which can preserve the previous values and also extract the features with convolution filter.

def cnn_lstm_model(shape, print_summary=False):
model = Sequential()
model.add(Conv1D(filters=128, kernel_size=2, activation='tanh',
kernel_initializer = 'he_uniform',
input_shape=(shape, 1)))
model.add(Conv1D(filters=128, kernel_size=2, activation='tanh',
kernel_initializer = 'he_uniform'))
model.add(MaxPooling1D(pool_size=3))
model.add(LSTM(128, activation='tanh', return_sequences=True))
model.add(LSTM(128, activation='tanh'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(8, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam', metrics=[tf_smape])
return model
X = np.log1p(X_timeseries.iloc[:,-best_window-1:-1])
r, c = X.shape
X = np.array(X).reshape((r, c, 1))
Y = np.log1p(X_timeseries.iloc[:,-1])

#fitting model wth best parameters
model = cnn_lstm_model(best_window, print_summary=True)
model.fit(X, Y, batch_size=batch, epochs=epoch, verbose=1)

#save the model weights
my_model = model.to_json()
with open("cnn_lstm_model.json", "w") as json_file:
json_file.write(my_model)
model.save_weights("cnn_lstm_model_weights.h5")

Model performance deteriorates as the test time stamps increases.

Tabulating overall results of each of the DL models, where each of them was tested on cross validation dataset to find the best parameters so as to run the same on actual test data. Test score from Kaggle for few of the models also included in this.

Conclusion and Areas of Improvement :

Concluding the complete analysis done on this problem, we can say that, both ML and DL modeling with basic feature engineering give us considerably better results. Adding features extracted from Page Name, can certainly gave a good boost to ML modeling.

All above models, can be improved with intensive hyper-parameter search keeping in mind the impact of Over-fitting and Under-fitting. For DL Models, we can also do hyper-parameter search for each of the layer parameters. Also, regularization techniques can be implemented so as to to keep the weights optimal for better performance of the model.

For further improvement, we can also add new features like, day of the week, seasonal holidays etc.

References :

ML/AI Enthusiast