# Probabilistic Reasoning for Sequential Data

Probabilistic Reasoning for Sequential DataIn the world of machine learning, we encounter many types of data, such as images, text, video, sensor readings, and so on.

Different types of data require different types of modelling techniques.

Sequential data refers to data where the ordering is important.

DammnnBlockedUnblockFollowFollowingMay 10Picture from: https://en.

wikipedia.

org/wiki/Supervised_learningTime-series data is a particular manifestation of sequential data.

It is basically time-stamped values obtained from any data source such as sensors, microphones, stock markets, and so on.

Time-series data has a lot of important characteristics that need to be modelled in order to effectively analyze the data.

The measurements that we encounter in time-series data are taken at regular time intervals and correspond to predetermined parameters.

These measurements are arranged on a timeline for storage, and the order of their appearance is very important.

We use this order to extract patterns from the data.

In this chapter, we will see how to build models that describe the given time-series data or any sequence in general.

These models are used to understand the behaviour of the time series variable.

We then use these models to predict the future based on past behaviour.

Time-series data analysis is used extensively in financial analysis, sensor data analysis, speech recognition, economics, weather forecasting, manufacturing, and many more.

We will explore a variety of scenarios where we encounter time-series data and see how we can build a solution.

We will be using a library called Pandas to handle all the time-series related operations.

We will also use a couple of other useful packages like hmmlearn and pystruct during this chapter.

Make sure you install them before you proceed.

You can install them by running the following commands on your Terminal:\$ pip3 install pandas\$ pip3 install hmmlearn\$ pip3 install pystruct\$ pip3 install cvxoptIf you get an error when installing cvxopt, you will find further instructions at http://cvxopt.

org/install .

Now that you have successfully installed the packages, let's go ahead to the next section.

Handling time-series data with PandasLet’s get started by learning how to handle time-series data in Pandas.

In this section, we will convert a sequence of numbers into time series data and visualize it.

Pandas provides options to add timestamps, organize data, and then efficiently operate on it.

Create a new Python file and import the following packages:import numpy as np import matplotlib.

pyplot as plt import pandas as pdDefine a function to read the data from the input file.

The parameter index indicates the column number that contains the relevant data:def read_data(input_file, index): # Read the data from the input file input_data = np.

loadtxt(input_file, delimiter=',')Define a lambda function to convert strings to Pandas date format:# Lambda function to convert strings to Pandas date format to_date = lambda x, y: str(int(x)) + '-' + str(int(y))Use this lambda function to get the start date from the first line in the input file:# Extract the start date start = to_date(input_data[0, 0], input_data[0, 1])Pandas library needs the end date to be exclusive when we perform operations, so we need to increase the date field in the last line by one month:# Extract the end date if input_data[-1, 1] == 12: year = input_data[-1, 0] + 1 month = 1 else: year = input_data[-1, 0] month = input_data[-1, 1] + 1 end = to_date(year, month)Create a list of indices with dates using the start and end dates with a monthly frequency:# Create a date list with a monthly frequency date_indices = pd.

date_range(start, end, freq='M')Create pandas data series using the timestamps:# Add timestamps to the input data to create time-series data output = pd.

Series(input_data[:, index], index=date_indices) return outputDefine the main function and specify the input file:if __name__=='__main__': # Input filename input_file = 'data_2D.

txt'Specify the columns that contain the data:# Specify the columns that need to be converted # into time-series data indices = [2, 3]Iterate through the columns and read the data in each column:# Iterate through the columns and plot the data for index in indices: # Convert the column to timeseries format timeseries = read_data(input_file, index)Plot the time-series data:# Plot the data plt.

figure() timeseries.

plot() plt.

title('Dimension ' + str(index – 1)) plt.

show()If you run the code, you will see two screenshots.

The following screenshot indicates the data in the first dimension:The second screenshot indicates the data in the second dimension:Slicing time-series dataNow that we know how to handle time-series data, let’s see how we can slice it.

The process of slicing refers to dividing the data into various sub-intervals and extracting relevant information.

This is very useful when you are working with time-series datasets.

Instead of using indices, we will use timestamp to slice our data.

Create a new Python file and import the following packages:import numpy as np import matplotlib.

pyplot as plt import pandas as pd from timeseries import read_dataLoad the third column (zero-indexed) from the input data file:# Load input data index = 2 data = read_data('data_2D.

txt', index)Define the start and end years, and then plot the data with year-level granularity:# Plot data with year-level granularity start = '2003' end = '2011' plt.

figure() data[start:end].

plot() plt.

title('Input data from ' + start + ' to ' + end)Define the start and end months, and then plot the data with month-level granularity:# Plot data with month-level granularity start = '1998-2' end = '2006-7' plt.

figure() data[start:end].

plot() plt.

title('Input data from ' + start + ' to ' + end) plt.

show()The full code is given in the file slicer.

py.

If you run the code, you will see two figures.

The first screenshot shows the data from 2003 to 2011:The second screenshot shows the data from February 1998 to July 2006:Operating on time-series dataPandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition.

You can simply set some conditions and Pandas will filter the dataset and return the right subset.

You can add two time-series variables as well.

This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:import numpy as np import pandas as pd import matplotlib.

pyplot as plt from timeseries import read_dataDefine the input filename:# Input filename input_file = 'data_2D.

txt'Load the third and fourth columns into separate variables:# Load data x1 = read_data(input_file, 2) x2 = read_data(input_file, 3)Create a Pandas dataframe object by naming the two dimensions:# Create pandas dataframe for slicing data = pd.

DataFrame({'dim1': x1, 'dim2': x2})Plot the data by specifying the start and end years:# Plot data start = '1968' end = '1975' data[start:end].

plot() plt.

title('Data overlapped on top of each other')Filter the data using conditions and then display it.

In this case, we will take all the datapoints in dim1 that are less than 45 and all the values in dim2 that are greater than 30:# Filtering using conditions # – 'dim1' is smaller than a certain threshold # – 'dim2' is greater than a certain threshold data[(data['dim1'] < 45) & (data['dim2'] > 30)].

plot() plt.

title('dim1 < 45 and dim2 > 30')We can also add two series in Pandas.

Let's add dim1 and dim2 between the given start and end dates:# Adding two dataframes plt.

figure() diff = data[start:end]['dim1'] + data[start:end]['dim2'] diff.

plot() plt.

title('Summation (dim1 + dim2)') plt.

show()The full code is given in the file operator.

py.

If you run the code, you will see three screenshots.

The first screenshot shows the data from 1968 to 1975:The second screenshot shows the filtered data:The third screenshot shows the summation result:Pandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition.

You can simply set some conditions and Pandas will filter the dataset and return the right subset.

You can add two time-series variables as well.

This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:import numpy as np import pandas as pd import matplotlib.

pyplot as plt from timeseries import read_dataDefine the input filename:# Input filename input_file = 'data_2D.

txt'Load the third and fourth columns into separate variables:# Load data x1 = read_data(input_file, 2) x2 = read_data(input_file, 3)Create a Pandas dataframe object by naming the two dimensions:# Create pandas dataframe for slicing data = pd.

DataFrame({'dim1': x1, 'dim2': x2})Plot the data by specifying the start and end years:# Plot data start = '1968' end = '1975' data[start:end].

plot() plt.

title('Data overlapped on top of each other')Filter the data using conditions and then display it.

In this case, we will take all the datapoints in dim1 that are less than 45 and all the values in dim2 that are greater than 30:# Filtering using conditions # – 'dim1' is smaller than a certain threshold # – 'dim2' is greater than a certain threshold data[(data['dim1'] < 45) & (data['dim2'] > 30)].

plot() plt.

title('dim1 < 45 and dim2 > 30')We can also add two series in Pandas.

Let's add dim1 and dim2 between the given start and end dates:# Adding two dataframes plt.

figure() diff = data[start:end]['dim1'] + data[start:end]['dim2'] diff.

plot() plt.

title('Summation (dim1 + dim2)') plt.

show()If you run the code, you will see three screenshots.

The first screenshot shows the data from 1968 to 1975:The second screenshot shows the filtered data:The third screenshot shows the summation result:Extracting statistics from time-series dataIn order to extract meaningful insights from time-series data, we have to extract statistics from it.

These stats can be things like mean, variance, correlation, maximum value, and so on.

These stats have to be computed on a rolling basis using a window.

We use a predetermined window size and keep computing these stats.

When we visualize the stats over time, we will see interesting patterns.

Let’s see how to extract these stats from time-series data.

Create a new Python file and import the following packages:import numpy as np import matplotlib.

pyplot as plt import pandas as pd from timeseries import read_dataDefine the input filename:# Input filename input_file = 'data_2D.

txt'Load the third and fourth columns into separate variables:# Load input data in time series format x1 = read_data(input_file, 2) x2 = read_data(input_file, 3)Create a pandas dataframe by naming the two dimensions:# Create pandas dataframe for slicing data = pd.

DataFrame({'dim1': x1, 'dim2': x2})Extract maximum and minimum values along each dimension:# Extract max and min values print('?.Maximum values for each dimension:') print(data.

max()) print('?.Minimum values for each dimension:') print(data.

min())Extract the overall mean and the row-wise mean for the first 12 rows:# Extract overall mean and row-wise mean values print('?.Overall mean:') print(data.

mean()) print('!.Row-wise mean:') print(data.

mean(1)[:12])Plot the rolling mean using a window size of 24:# Plot the rolling mean using a window size of 24 data.

rolling(center=False, window=24).

mean().

plot() plt.

title('Rolling mean')Print the correlation coefficients:# Extract correlation coefficients print('.Correlation coefficients:.', data.

corr())Plot the rolling correlation using a window size of 60:# Plot rolling correlation using a window size of 60 plt.

figure() plt.

title('Rolling correlation') data['dim1'].

rolling(window=60).

corr(other=data['dim2']).

plot() plt.

show()The full code is given in the file stats_extractor.

py.

If you run the code, you will see two screenshots.

The first screenshot shows the rolling mean:The second screenshot shows the rolling correlation:You will see the following on your Terminal:If you scroll down, you will see row-wise mean values and the correlation coefficients printed on your Terminal:The correlation coefficients in the preceding figures indicate the level of correlation of each dimension with all the other dimensions.

A correlation of 1.

0 indicates perfect correlation, whereas a correlation of 0.

0 indicates that they the variables are not related to each other.

Generating data using Hidden Markov ModelsA Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data.

It assumes that the system being modeled is a Markov process with hidden states.

This means that the underlying system can be one among a set of possible states.

It goes through a sequence of state transitions, thereby producing a sequence of outputs.

We can only observe the outputs but not the states.

Hence these states are hidden from us.

Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York.

His goal is to minimize the traveling time so that he can be more efficient.

Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city X to city Y.

In the information given below, P(X -> Y) indicates the probability of going from city X to city Y:P(London -> London) = 0.

10P(London -> Barcelona) = 0.

70P(London -> NY) = 0.

20P(Barcelona -> Barcelona) = 0.

15P(Barcelona -> London) = 0.

75P(Barcelona -> NY) = 0.

10P(NY -> NY) = 0.

05P(NY -> London) = 0.

60P(NY -> Barcelona) = 0.

35Let’s represent this information with a transition matrix:London Barcelona NYLondon 0.

10 0.

70 0.

20Barcelona 0.

75 0.

15 0.

10NY 0.

60 0.

35 0.

05Now that we have all the information, let’s go ahead and set the problem statement.

The salesman starts his journey on Tuesday from London and he has to plan something on Friday.

But that will depend on where he is.

What is the probability that he will be in Barcelona on Friday.This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like.

Our goal is to say with a good amount of certainty that he will be in a particular city on a given day.

If we denote the transition matrix by T and the current day by X(i), then:X(i+1) = X(i).

TIn our case, Friday is 3 days away from Tuesday.

This means we have to compute X(i+3).

The computations will looks like this:X(i+1) = X(i).

TX(i+2) = X(i+1).

TX(i+3) = X(i+2).

TSo in essence:X(i+3) = X(i).

T³We need to set X(i) as given here:X(i) = [0.

10 0.

70 0.

20]The next step is to compute the cube of the matrix.

There are many tools available online to perform matrix operations such as http://matrix.

reshish.

com/multiplication.

php .

If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:P(London) = 0.

31P(Barcelona) = 0.

53P(NY) = 0.

16We can see that there is a higher chance of him being in Barcelona than in any other city.

This makes geographical sense as well because Barcelona is closer to London compared to New York.

Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:import datetime import numpy as np import matplotlib.

pyplot as plt from hmmlearn.

hmm import GaussianHMM from timeseries import read_dataLoad data from the input file:# Load input data data = np.

txt', delimiter=',')Extract the third column for training:# Extract the data column (third column) for training X = np.

column_stack([data[:, 2]])Create a Gaussian HMM with 5 components and diagonal covariance:# Create a Gaussian HMM num_components = 5 hmm = GaussianHMM(n_components=num_components, covariance_type='diag', n_iter=1000)Train the HMM:# Train the HMM print('.Training the Hidden Markov Model.

') hmm.

fit(X)Print the mean and variance values for each component of the HMM:# Print HMM stats print('.Means and variances:') for i in range(hmm.

n_components): print('.Hidden state', i+1) print('Mean =', round(hmm.

means_[i], 2)) print('Variance =', round(np.

diag(hmm.

covars_[i]), 2))Generate 1200 samples using the trained HMM model and plot them:# Generate data using the HMM model num_samples = 1200 generated_data, _ = hmm.

sample(num_samples) plt.

plot(np.

arange(num_samples), generated_data[:, 0], c='black') plt.

title('Generated data') plt.

show()The full code is given in the file hmm.

py.

If you run the code, you will see the following screenshot that shows the 1200 generated samples:You will see the following printed on your Terminal:A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data.

It assumes that the system being modeled is a Markov process with hidden states.

This means that the underlying system can be one among a set of possible states.

It goes through a sequence of state transitions, thereby producing a sequence of outputs.

We can only observe the outputs but not the states.

Hence these states are hidden from us.

Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York.

His goal is to minimize the traveling time so that he can be more efficient.

Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city X to city Y.

In the information given below, P(X -> Y) indicates the probability of going from city X to city Y:P(London -> London) = 0.

10P(London -> Barcelona) = 0.

70P(London -> NY) = 0.

20P(Barcelona -> Barcelona) = 0.

15P(Barcelona -> London) = 0.

75P(Barcelona -> NY) = 0.

10P(NY -> NY) = 0.

05P(NY -> London) = 0.

60P(NY -> Barcelona) = 0.

35Let’s represent this information with a transition matrix:London Barcelona NYLondon 0.

10 0.

70 0.

20Barcelona 0.

75 0.

15 0.

10NY 0.

60 0.

35 0.

05Now that we have all the information, let’s go ahead and set the problem statement.

The salesman starts his journey on Tuesday from London and he has to plan something on Friday.

But that will depend on where he is.

What is the probability that he will be in Barcelona on Friday.This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like.

Our goal is to say with a good amount of certainty that he will be in a particular city on a given day.

If we denote the transition matrix by T and the current day by X(i), then:X(i+1) = X(i).

TIn our case, Friday is 3 days away from Tuesday.

This means we have to compute X(i+3).

The computations will looks like this:X(i+1) = X(i).

TX(i+2) = X(i+1).

TX(i+3) = X(i+2).

TSo in essence:X(i+3) = X(i).

T³We need to set X(i) as given here:X(i) = [0.

10 0.

70 0.

20]The next step is to compute the cube of the matrix.

There are many tools available online to perform matrix operations such as http://matrix.

reshish.

com/multiplication.

php .

If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:P(London) = 0.

31P(Barcelona) = 0.

53P(NY) = 0.

16We can see that there is a higher chance of him being in Barcelona than in any other city.

This makes geographical sense as well because Barcelona is closer to London compared to New York.

Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:import datetime import numpy as np import matplotlib.

pyplot as plt from hmmlearn.

hmm import GaussianHMM from timeseries import read_dataLoad data from the input file:# Load input data data = np.