Abstract:
The research employs a quantitative comparative design to forecast fine particulate
matter (PM2.5) concentrations in Colombo, Sri Lanka, using particulate data from
the U.S. Embassy and meteorological data (i.e., air temperature at 2 meters, relative
humidity at 2 meters, and wind speed at 2 meters) from the NASA Open Data Portal.
The dataset spans January 2018 to July 2024, with 80% of the data (January 2018 to
September 2022) used for training and the remaining 20% for testing. Data
preprocessing involved interpolation of missing values, normalization, and
engineering of lagged variables such as 24-hour PM2.5 lags. Two predictive models
were compared: Seasonal Autoregressive Integrated Moving Average (SARIMA),
which captures seasonal patterns, and Extreme Gradient Boosting (XGBoost), which
models non-linear relationships and complex feature interactions. Model
performance was evaluated using root mean square error (RMSE), mean absolute
percentage error (MAPE), and R-squared metrics. The SARIMA model,
implemented with the “forecast” package in R, achieved an RMSE of 16.99, a MAPE
of 20.35%, and an R-squared of 0.68, and demonstrated superior seasonality
modeling with a lower Bayesian Information Criterion (BIC). The XGBoost model,
trained with the “xgboost” package in R, leveraged advanced regularization and
parallel processing to reduce prediction errors by 18-22%, excelled in forecasting
extreme pollution events, and identified lagged PM2.5 values as the most influential
predictors. Diagnostic analyses showed SARIMA’s effectiveness in seasonality and
residual behavior modeling, while XGBoost excelled at capturing key predictors and
nonlinear effects. The findings underscore the potential of advanced machine
learning, especially XGBoost and ensemble methods, for accurate and timely air
quality forecasting tailored to Sri Lanka’s climatic and emission context. The study
recommends integrating such models into national air quality systems to enable real-
time forecasts and health alerts, expanding the air sensor network in high-risk urban
areas, and enforcing targeted pollution regulations. Future research should explore
hybrid modeling using satellite and real-time traffic data, alongside explainable AI
techniques, to enhance forecast accuracy and support data-driven environmental
policy.