Predicting sales of items for the next 28 days.
Table of content:
- Business Problem.
- Source of data.
- Use of Machine learning.
- Existing approach of the problem.
- Exploratory Data Analysis.
- My first cut approach to the problem.
- Models explanation.
- Comparison of the models.
- Future work.
1. Business Problem
Big Retails stores like Walmart that operates a chain of hypermarkets, discount department stores, and grocery stores have to keep a track on sales and pricing of goods. They have of keep a balance between demand of the products and the stock of the product in the stores. Therefore estimating the sales of the products will be very beneficial for the revenue of the business.
The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions and it is their fifth iteration/competition (M5 forecasting) which is also organized in kaggle.
Our main objective is to predict sales of item for the given store for next 28 days.
The dataset is directly available on kaggle.
The dataset consist of sales of previous 1941 days sales of 3049 items in 10 stores of 3 states in US. Apart from historical sales data we also have rate of each item at corresponding store and dates information like events on that corresponding date.
The data is available in CSV file format as follows.
calendar.csv- Contains information about the dates on which the products are sold.
sales_train_validation.csv- Contains the historical daily unit sales data per product and store
[d_1 - d_1913]
sample_submission.csv- The correct format for submissions..
sell_prices.csv- Contains information about the price of the products sold per store and date.
sales_train_evaluation.csv- Includes sales
[d_1 - d_1941]
3. Use of Machine learning
The time series problem can be converted to supervised machine learning problem, by some feature engineering techniques. It can be framed as Regression problem with generated inputs as features and sales of items as target variable/feature.
The accuracy of the point forecasts will be evaluated using the Root Mean Squared Scaled Error (RMSSE), which is a variant of the well-known Mean Absolute Scaled Error (MASE).The measure is calculated for each series as follows:
- The measure is scale independent, meaning that it can be effectively used to compare forecasts across series with different scales.
- The measure penalizes positive and negative forecast errors, as well as large and small forecasts, equally, thus being symmetric.
For more info about metric, go through the competition guidelines.
4. Existing approaches of the problem.
Following are some Naïve and statistical approaches which were mostly observed (in kaggle solutions)
- Taking average of all data as prediction for next day sales
- Taking mean of previous 10, 20, 30, 50 days sales data as prediction of next day sales.
- Using time series methods like naïve seasonal forecasting, Moving average, exponential smoothing, ARIMA
The other type of methods were using Machine learning algorithms. First preprocessing the data and converting it into supervised machine learning format and then applying any regression algorithm which gives best result.
Mostly machine learning algorithm overperformed the statistical method, therefore it is preferred more.
Improvements that can be added:
- For Machine learning methods, hyperparameter tuning was not done much. Using any cross validation techniques appropriate parameters can be used.
- The interpretability of model was also not showcased. For store manager it is necessary to have reasons for the prediction of sales of a particular item.
5. Exploratory Data Analysis.
- After some decline in sales in 2011 an upward trend is observed
- Zero sales is observed in each year that is christmas (2011–12–25).
- Comparing to all years, little faster growth is observed in 2016.
- Comparing to other states California(CA) sales performed better, while sales of Texas(TX) and Wisconsin(WI) found to be intersecting with each other, ending up with a raise in Wisconsin(WI).
- Mostly the raise in sales in a year was observed in pre-Autumn months (Aug-Oct)
- In 2014–15 a dip was observed in sales, which was severe in CA and slightly less in other two.
Aggregating sales on state store level
- There is almost similar trend observed in all stores of california, except CA_2 store. From mid of 2014 the sales of CA_2 has started decline and met with CA_4 in mid 2015 and then had sudden increase and followed CA_1 trend.
- The sales of TX_1 and TX_2 follows similar trend while the sales of TX_3 often intersects with TX_1 for the starting years (2011–13) and with TX_2 for ending years(2015–16).
- No similar trend observed among the sales of stores of Wisconsin(WI).
- Store WI_3 had the highest sales for the starting years (2011–13) but decline in mid years(2013–14), started increasing in 2015.
- Store WI_1 and WI_2 have almost similar sales at beginning. The sales of WI_2 suddenly increased in mid of 2012 and followed an increasing trend, whereas WI_1 suddenly increased (not up to sales of WI_2) in end of 2012 and also followed an increasing trend further.
- FOODS is the most sold categoy followed by HOUSEHOLD and HOBBIES. It is obvious that people will need essential items the most than other things.
- FOODS_3 is driving the sales of FOOD Category,while FOODS_2 is little picking up at the end.
- HOUSEHOLD_1 tends to follow an increasing trend while the other three (HOOBIES_1,HOBBIES_2, HOUSEHOLD_2) seems to be settled at the same range of sales.
Comparing sales on normal days vs event and snap days
- Average sales on Non-Event day is slightly greater to average sale on event day for all categories.
- On snap days as the item prices are low the average sales are slightly high than normal sales as rates of items are less than normal days.
- Among all the category FOODS_3 category have large difference between sales on snap and normal days.
- An upward trend in overall sales was observed.
- Among all three states California(CA) has more sales because it has more stores than other states and population is also more compared to others.
- Food category is most sold item as it is more essential than household and hobbies.
- Sales didn’t varied much on event days but slightly increased on snap days, because of lower rates.
- Their were 4 types of events out of them Religious events occured most.
- CA_3 store of california has the highest sales among all the stores whereas the lowest sales was observed in CA_4. Maybe the location and population is the reason for it. CA_3 maybe is in some urban area and CA_4 is in little remote area.
- People mostly like to shop on weekends, as most of the people have break on weekends.
- Mostly sales are on peak in the end of Summer(August)/start of Autumn,a sudden dip in may and stable in winter
6. My first cut approach to the problem
- To apply Machine learning techniques on time series forecasting data, there is need to convert the data set into single supervised learning dataset. The unique data is present in three different files sales, calendar, price and the sales data is present in wide form as in the each day is the column for each product and the sales are the values. Therefore making it to long form by considering all product features and date as the common features and sales data as target variable. After that merging with calendar and price on date and id.
- Downcasting means type-casting of the data and to reduce the amount of storage used by them. As pandas automatically create int32, int64, float32, or float64 columns for numeric ones, we can convert them into int8/int16 and float8/float16. Also, Pandas stores categorical columns as objects which take more storage as compare to category datatype, so convert object datatype to category datatype.
- After doing EDA pre-processing the data, dealing with Nan values depending on feature, dealing with missing values with imputation techniques.
- Encode Categorical features for some model if they don’t have inbuilt encoding parameter for categorical features. Feature Engineering will be done by introducing new features like lag and rolling features. Lag features in time series are those features with previous k values [t, (t-1), (t-2),…,(t-k)]. Same goes with rolling features but it applies of function like mean, median on the last k values and creates a new feature.
- Time based splitting the data. Data from days 1914–1941 are for validation and days from 1942–1969 are for testing.
- After completing all above processing, tried following models.
7. Models explanation
Tried 4 models 1 simple baseline model and 3 other boosting models.
- Simple Exponential Smoothing : Exponential smoothing is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time.
- XGBoost which is short for “Extreme Gradient Boosting,” is a library that provides an efficient implementation of the gradient boosting algorithm. The main benefit of the XGBoost implementation is computational efficiency and often better model performance.
- After tuning parameters like learning rate and max_depth.
- XGBoostRegressor modeling and prediction:-
- CatBoost is a third-party library developed at Yandex that provides an efficient implementation of the gradient boosting algorithm. The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for “Category Gradient Boosting.”
- After tuning parameters like learning rate and depth modeling CatBoostRegressor.
- LightGBM, short for Light Gradient Boosted Machine, is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm. The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.
- After tuning parameters like learning rate and max_depth modeling LGBMRegressor.
8. Comparison of the models.
Here are kaggle leaderboard private and public scores of all models
- From the above table, we can conclude that CatBoostRegressor is the best model.
- Out of 5558 participants the ranks for score 0.685 were in range of 490–500’s i.e the score can be considered under top 10% percentile rank.
9. Future work.
- For calculation of lag and rolling features for test data should try using prediction of previous day sales instead using the same values as validation data’s features.
- Adding more statistical features.
Forecasting sales for further 28 days for a given item of a store Dataset…