COVID-19: Pushing the Limits of Time Series Big Data

In Malaysia, the pandemic coronavirus disease 2019 (COVID-19) was first detected on 25th January and has been spreading massively and reported to have reached more than 20,000 new cases per day from July to August 2021. COVID-19 data is voluminous describing the pandemic trend around the globe. How does Big Data help decision-makers understand the pandemic behaviour which is very crucial in responding to the situation? How do data analytics on the COVID-19 spreading pattern which is time-series in nature may provide insight into the situation that may lead to a better response through forecasting future trends? This paper aims to explain the concept of Big Data and its applications that demonstrates its potential for responding to the pandemic. COVID-19 data analytic is proposed using sliding window time-series forecasting method and demonstrated using data from 25th January until 10th October 2020 obtained from the Malaysian Ministry of Health and Department of Statistics Malaysia website. The data analytics demonstrated the value gain for useful insights.


Introduction
Novel Coronavirus disease is an infectious disease first detected in Wuhan on the 31 st December 2019 caused by a newly discovered coronavirus known as COVID-19. It has the symptoms of a severe acute respiratory syndrome such as coughing, persistent chest pain and difficulty in breathing (WHO, 2012). The disease was first reported in Malaysia on 25 January 2020 on 3 travellers from China gradually lead to the first sporadic infected person reported on 11th March. This implies that he did not contract the disease due to travelling overseas or close contact. Since then, there was a sharp increase in the trend of people infected. In Malaysia, a country of 31 million people, over 15,657 have been infected with 69.70 per cent have been cured with 157 fatalities implying a fatality rate of 1.0% reported until 11th October 2020. These statistics prove that Malaysia's strategies in attempting to break the chain of infection through several MCO strategies since 18 March 2020 show a remarkable result (MOH, 2020;DOSM, 2020). Analysis of the current situation, its trend and patterns are very crucial in supporting and facilitating government decision-making especially at a major critical turning point of an event.

Problem Statement
In Malaysia, the first reported cases imported from China travelers was on 25 January and a local sporadic infected person was detected on 11 March 2020. Movement Control Order (MCO) was declared on 18th March upon the trace of the first spike in the epidemic trend. The third wave took place end of September with much higher positive cases compared to the maximum from the second wave, especially in Sabah and Kedah. With the daily voluminous data across the country and the globe, how will Big Data be used to forecast the pandemic spreading pattern in Malaysia thus assist in the decision-making of the Health Ministry and National Security Division? What is the duration of the onset of the pandemic until recovery or fatality? How would time series forecasting be useful in determining the duration of an infection?

The Role of Big Data in Managing the Covid19 Pandemic
With the advent of the Internet and Web 2.0 technology, voluminous data are being created every millisecond. However, these massive data remain useless in their raw form unless transformed into a useful form to assist for problem solving, decision-making, providing useful insights that eventually can become an economic commodity.
Examples of big data applications by key players in the industry are Amazon.com strategized its business strategy using personalized marketing and recommendations system with a targeted segment of customers that has similar interests identified through books search and bought. Besides that, digital marketers are benefitting from the competitive edge by pushing ads to targeted social media customers such as Facebook, Twitter, Instagram and YouTube by studying their online behaviour through sentiment analysis. The mass volume of big data is characterized with 5 elements which are Volume, Velocity, Variety, Veracity and Value known as 5V ( O'Reilly Media, 2012;Hamdan et al., 2018;) describe as follows : Volume Huge size of data is generated every second, for instance, emails, websites, Internet of Things sensors, Whatsapp messages, status posting on social media such as Facebook, Twitter, Instagram and so on. Due to the mass volume of the data collected, alternative processing and analysis techniques are required to handle the size.

Velocity
The speed of the data being created and distributed is rapid almost an instance of time such as sending messages, photos, social media status, stories and live streaming over the Internet. Cloud and mobile platform. The current big data technology can execute realtime data processing such as radar tracking for air flight, Global Positioning Sensor (GPS) tracking on transports, apps, Google map and laptop.

Variety
The data in the current Algorithm Age or the Algorithmic Economic are various formats such as text, images, video, audio, structured and unstructured in contrast with the Information Age that analyses structured data.
Veracity Since 80% of the data is unstructured, extraction for hidden patterns is very challenging due to the dynamicity, fast-changing volatile data, data loss and noises, thus influence the accuracy, trustworthiness, and data quality in general.
Value The patterns extracted from the data give value for decision-making and problemsolving which is the key importance of Big Data. For example, during the Obama election campaign in 2012, many data scientists were hired to mine data and conduct sentiment analysis to gauge the inclination of the voters towards certain issues which gave a huge advantage to him (Hamdan, 2018).
Due to the 5V characteristics of Big Data, analysing those voluminous pandemic data using conventional techniques such as databases, Statistics and Data Engineering is no more efficient and effective. A special database is required with higher capacity suitable for millions of data being collected and analysed within minutes to give useful information to users.

Big Data Applications for COVID 19
Some examples of Big Data applications are shortest route search using Google Map or Waze, recommendation systems such as Amazon.com for books, Trivago for hotels and weather forecast that can give an early warning on an expected storm, earthquakes and so on. Dr Kamran Kahn, an epidemiologist and practising physician trained in advanced data analytics launched BlueDot, a Toronto-based startup that developed proprietary software as a service capable of locating, tracking, and predicting the spread of infectious disease in 2015. The BlueDot engine searched data every 15 minutes every day gathering over 150 diseases and syndromes around the world. He builds the world's first Artificial Intelligence (AI)-based infectious disease surveillance equipped with Big Data functionality with a global early warning system capable to track and contextualize infectious disease risks. BlueDot was the first to detect the epidemic on a cluster of "unusual pneumonia" on 31 December 2019 from articles in Chinese that reported 27 pneumonia cases happening around a market that had seafood and live animals in Wuhan, China (Bragazzi et. al, 2020;McCall, 2020).
BlueDot was even able to anticipate the spread based on the highest volume air flights movements from Wuhan to Bangkok, Hong Kong, Tokyo, Taipei, Phuket, Seoul, and Singapore which in actuality was also the first places to record COVID-19 cases. BlueDot demonstrated the capability of AI and Big Data to co-exist with the decision-makers to give useful insights and predictive analytics that can speed up decisions, facilitate strategic problem solving and innovate solutions.
Another prominent example of using Big Data for monitoring the pandemic is John Hopkins University, the United States of America that has specially dedicated a website with important sources of information on COVID 19 with real-time global data visualization on the spreading of the virus using computational techniques through its CoronaVirus Resource Centre as shown in Figure 1. In terms of research, Qin et al. (2020) exploited Big Data to detect new COVID-19 suspected cases in 6-9 days and confirmed cases in 10 days whereas Yang et al. (2020) predicted the COVID-19 epidemic peaks and sizes. In Iran, Ahmadi et al. (2020) study on the correlation between climatology parameters on the COVID-19 outbreak using sensitivity analysis. He found out coronavirus highly infected patient may survive due to low values of wind speed, humidity, and solar radiation exposure. He also concluded that locations with high population density, intra-provincial movements and humidity rate are more at risk.
Besides that, the Ministry of Health and National Security Division and the Department of Statistics Malaysia also has a dedicated web page for the Corona Virus updates as shown in Figure 2. ESRI ( https://coronavirus-nsesrimy.hub.arcgis.com/) is another website that also displays the current COVID 19 situation using Geographical Information System (GIS) data is as in Figure 3.  The chart shows the second and the third wave of the coronavirus outbreak where the numbers of active cases are higher than the peak during the second wave. The MCO in the third wave focused on the locality with a red zone with a strict movement for that community observing the standard operating procedures enforcement such as wearing a face mask, keeping physical distance and home quarantine for those in close contacts with an infected person while waiting for the swab test result as recommended by the Ministry of Health and National Security Division.
Nevertheless, to gain value from the Big Data, data analytics is very much the integral component in giving insights and forecast trend based on the outbreak pattern. Next section, the analytical part of COVID19 outbreak data in Malaysia will be presented using time series analysis and forecast the possible future trend with predictive analytics.

Methodology
Time-series data analysis is used to forecast the pandemic spreading pattern. Since the infection has a duration of a maximum of 14 days as claimed by WHO, a sliding window approach is being adopted for the data preparation to capture the range of days of infection. Then cleansed data in a sliding window format will be analysed using the Multiple Linear Regression (MLR) technique which will be compared to Artificial Neural Network (ANN) performance. This study also indicates the importance of historical data predicting future trends adopted by Norita et al. (2005) and Wan Hussain et al. (2011) through manipulation of sliding window method for time-series data to make early decision-making in a flood emergency management of a water reservoir based on daily rainfall data and reservoir water level.

Sliding Window Time Series Data Analytics and Experimentation
Time series data are usually analyzed as a single variable (univariate) varying over time where event happening at time t will be defined by the previous event on the time scale. It is dependent on historical data used to predict the next output on the timeline such that y(t) = f(y(t-n)) where n is the size of the window indicates the time frame being investigated as illustrated in Figure 6 for window size =1. The time frame may indicate the temporal delay or lagging that has useful information that must be captured.
window size = 1 Figure 6: Sliding window conceptual view for window width =1 The outcome of current time t is based on the output of prior time, t-1 as in Equation 1.
If the window width is 2 , then the equation is defined as  Multistep forecasting can also be carried out to predict more than one future step as shown in Equation 3.

<yt, yt-1> = f(y t-1,y t-2)
(3) Table 2 shows two steps multi forecasting with the data restructuring for the sliding window of width=1. A multivariate time series data can also benefit from the sliding window method.
Assume variable a and b at time t, where yt = f(at, bt). Value of b at time t can be predicted as in Equation 4 and the example shown in Table 3. We may also predict more than one output variable such as in ANN, for example predicting both variables a and b at time t as shown in Equation 5. <at ,bt> = f(at-1 ,bt-1)  Determination on the window width and forecasting steps will depend on the nature of the problem or questions asked for the analysis. The accuracy of the analysis will determine which window size gave the best performance.
In this study, two experiments on time series data analytics were conducted using (a) multiple regression technique (b) multi-layer perceptron, a feedforward artificial neural network with the restructured dataset using the sliding window approach.

Time Series Data Analytics and Forecasting On Covid 19 Outbreak In Malaysia
A data analytic begin with the primary interest of the outbreak based on the question:

How different is the trend of the third wave outbreak compared to the second wave?
COVID19 outbreak data in Malaysia from 25 January 2020 until 11 Oktober 2020 (261 days) is used for the data analytics and forecast the future trends with (a) polynomial estimation (b) sliding window time series forecasting with multiple regression and multi-layer perceptron (MLP) feedforward artificial neural network (ANN) architecture.
Due to the simplicity and intuitiveness of the sliding window approach, temporal sequences are recorded based on a predefined time frame and transformed into a classification problem (Norita, 2004;Al-Turaiki, 2016). Analysis and experimentation were done using MS Excel Data Analytics and Weka 3.8.3 Forecasting to generate the predictive models.

Polynomial Estimation on The General Trend on COVID-19 Cases in Malaysia
During the second wave, MCO was declared by the government with enforcement through the police and army forces at the national level. Movements were restricted to only one member of the family preferably the male to get their livelihood needs while observing the SOP. However, during the third wave, MCO is confined to a particular community within an identified red zone area. Currently, the state of Sabah and Kedah are declared with enhanced conditional MCO.
Based on the data collected. a polynomial trendline estimated the projection of cumulative cases through curve-fitting of actual data using the 5th order as shown in  The projection of cumulative cases may be estimated from the trendline polynomial equation generated through curve-fitting of actual data using the 5th order. The estimated trend line equation is given by Equation 6 with R² = 0.991 y = -8E-09x 5 + 5E-05x 4 -0.0226x 3 + 3.488x 2 -115.36x + 695.5 (6) where y and x represent total cases and days, respectively. The number of total cases by 12 October 2020, which is the in the second wave of the outbreak, i.e. day 262 according to Equation 6 is estimated to 29170 persons infected. The projection shows an escalating spreading rate for total positive cases. However, the polynomial predictive model seems to be unrealistic with an 80% increase in cumulative cases. However, it shows that the community must strictly observe the physical distancing, wearing the facemask and adhere to self-quarantine if necessary, and heightened their hygiene practices as indicated by the high-risk situation. Figure 8 illustrates the pattern of the new cases reported and shows that spike of current new cases detected the current second wave is much higher than the maximum during the MCO1 and MCO2. The community must discipline themselves with the SOP to break the chain as soon as possible.  Next section will present the sliding window approach used for data preprocessing before training the predictive model with the restructured dataset.

A. Data Preparation with Sliding Window Representation
In this experiment, the analysis was conducted on the first and second wave data specifically recovery cases of COVID-19 patients from 25th January until 30th April 2020 indicated by the cumulative number of infected persons being discharged from the hospital.
The purpose of the experiment is to answer a. Is there a specific timeframe from the previous event that may influence the f u t ur e outcome?
b. What is the optimum sliding window width that gives good predictive analytics performance?
c. Is it consistent with other technique?

A. Sliding Window Time Series Forecasting with Multiple Regression
The sliding window size ranging from 2 until 16 was used due to the limit of predictor variables of the MS Excel for multiple regression. Table 7 presents the performance of the multiple regression according to the window size n with a 95% confidence level where n is statistically significant with F value < 0.05. The performance of the regression model using the sliding window perform far better than the polynomial regression, y=f(t), of window width = 0 as presented Equation 6. The experiment shows that historical time series or temporal data do influences the current outcome due to its continuity on the timeline. Thus, we may conclude that the number of discharge patients at time t may be explained by an event at t-n where n is the n th previous time frame. The experiment result also shows that that window of width 5 and 14 gave the best performance with p-value at each window width is less than 0.05 and statistically significant. This result is also consistent with current practice of days if quarantine due to infection or close contact.

B. Sliding Window Time Series Forecasting with Multilayer Perceptron Using Lag
Based on the performance in the regression model, the Multilayer Perceptron (MLP) analysis was conducted with window of width = 14 using lag in Weka 3.8.3. Data from Excel was converted to CSV format and save into .arff extension. The results of the four experiments conducted are tabulated in Table 9.  Table 9, the best result with the lowest Mean Square Error uses the original discharge data with lag set to 5 at the minimum and 14 at maximum using the result from the regression model where the MLP Neural Network architecture is as below The predictive model developed is data dependent. Table 10 shows the comparison between the predicted value with actual data.

MR -Multiple Regression; MLP -Multilayer Perceptron
The window width =14 implies that the current outcome may be predicted using a two weeks' time frame. This agrees with the current official estimated range for the incubation of novel coronavirus COVID-19 is 2-14 days. (Worldometer,12 March). Further, evaluation is needed to confirm whether the window width or lag size do represent the incubation period of the coronavirus. This result shows how data analytics facilitate decision-makers, health practitioners, environmentalist to understand the situation through the patterns emerging from the analysis.

Conclusion
COVID19 data outbreak has created interest in studying the trend, tracking the spread which demonstrated the capabilities of big data through visualization of the situation, data analytics with statistical, mathematical model and machine learning algorithm to investigate the behavior of the epidemic. The voluminous data will not gain its value without the analytical parts that will give insights into the situation thus facilitating decision-makers, health practitioners, environmentalist to understand the situation through the patterns emerging from the analysis.
This study also demonstrates the use of sliding window time series forecasting which is a convenient method to present the data as a classification problem for further analysis. Thus, in the data preparation, the actual data was restructured such that it represents a temporal classification or regression problem. Multilayer Perceptron Neural Network shows superior performance compared to multiple regression in forecasting future trends of the coronavirus outbreak. ____________________________________________________________