Simple Soybean Price Regression with Fast.ai Random Forests

Simple Soybean Price Regression with Fast.

ai Random ForestsMatthew ArthurBlockedUnblockFollowFollowingJan 22image credit: Pexels.

comApplying cutting-edge machine learning to commodity prices.

As a student in the fast.

ai Machine Learning for Coders MOOC¹ with an interest in agriculture the first application of the fast.

ai random forest regression library that came to mind was prediction of soybean prices from historical data.

Soybeans are a global commodity and their price-per-bushel has varied a great day over the past decade.

Individual commodity price histories are available as simple structured tabular data available online for free, making this a straightforward beginning topic.

Here’s the code — note we are using Python 3 and fast.

ai 0.

7, so follow the installation instructions.

²First, we need to import our packages before we begin: fast.

ai, quandl, pandas, sklearn — the usual data science libraries.

%load_ext autoreload%autoreload 2%matplotlib inlineimport quandlfrom fastai.

imports import *from fastai.

structured import *from pandas_summary import DataFrameSummaryfrom sklearn.

ensemble import RandomForestRegressor, RandomForestClassifierfrom IPython.

display import displayfrom sklearn import metricsData came from the free Quandl TFGRAIN/SOYBEANS dataset.

Quandl has a simple Python SDK for their API, getting-started instructions.

³ Pulling down the entire dataset is one (short) line:data = quandl.

get("TFGRAIN/SOYBEANS", authtoken="<your token")This returns a Pandas dataframe (which we call 'data’) anddata.

info() shows there are 4535 rows of data indexed by the event Datetime.

We can use data.

tail() to show the format:<class 'pandas.

core.

frame.

DataFrame'>DatetimeIndex: 4535 entries, 2000-12-01 to 2019-01-14Data columns (total 4 columns):Cash Price 4535 non-null float64Basis 4535 non-null float64Fall Price 4516 non-null float64Fall Basis 4516 non-null float64dtypes: float64(4)memory usage: 177.

1 KBdata.

head()As we have datetimes this is a great chance to take advantage of the fast.

ai library’s date handling feature engineering capabilities.

I renamed the index ‘data’ and made a new column with the same values that could be processed by the add_datepart() function:data.

rename_axis('Date')data['Date'] = data.

indexadd_datepart(data, 'Date')data.

info()The new columns are below.

Jeremy Howard explains in depth how these new columns are created, and why, in ML1 Lesson One.

<class 'pandas.

core.

frame.

DataFrame'>DatetimeIndex: 4535 entries, 2000-12-01 to 2019-01-14Data columns (total 18 columns):id 4535 non-null int64Cash Price 4535 non-null float64Basis 4535 non-null float64Fall Price 4516 non-null float64Fall Basis 4516 non-null float64Year 4535 non-null int64Month 4535 non-null int64Week 4535 non-null int64Day 4535 non-null int64Dayofweek 4535 non-null int64Dayofyear 4535 non-null int64Is_month_end 4535 non-null boolIs_month_start 4535 non-null boolIs_quarter_end 4535 non-null boolIs_quarter_start 4535 non-null boolIs_year_end 4535 non-null boolIs_year_start 4535 non-null boolElapsed 4535 non-null int64dtypes: bool(6), float64(4), int64(8)For my first cut at soybean price regression I am dropping a number of the columns for simplicity.

I’m keeping only the columns listed in the col_list variable.

col_list = ['Cash Price', 'Basis', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear']dfdata = dfdata[col_list]And to clean up the formatting I pull the values from the Pandas dataframe and reimport:df = dfdata.

valuesdf = pd.

DataFrame.

from_records(df)df.

columns = col_listNow that we again have a Pandas dataframe, I use another handy fast.

ai function which (as the documentation explains) changes columns of strings to a column of categorical values and does so in-place.

Why?.To allow a random forest regressor to make sense of tabular data.

Again, this is covered in more depth in the course.

train_cats(df)Another fast.

ai pre-processor which (as the documentation notes) takes a data frame, splits off the response variable, and changes the df into an entirely numeric dataframe.

Our dependent (‘y’) variable is the daily cash price of soybeans at this grain elevator, and everything else are independent variables.

df, y, nas = proc_df(df, 'Cash Price')And now we’re ready to fit our data!.It’s as easy as:m = RandomForestRegressor(n_jobs=-1)m.

fit(df, y)m.

score(df,y)Score: 0.

9991621993655437Pretty good!.Fast.

ai scores this at 99.

91% accurate, so now we can throw a test dataframe (with the same format) at the regressor and get an actual prediction.

Here’s our single-row test dataframe:df1 = pd.

DataFrame({'Basis':[-.

85], 'Year':[2019], 'Month':[1], 'Week':[4], 'Day':[25], 'Dayofweek':[6], 'Dayofyear':[25],})And to get a single prediction for the price on 25 January 2019 given a basis of 85 cents:train_cats(df1)m.

predict(df1) Return: array([8.

465])A bushel of soybeans is estimated to sell for $8.

465.

For fun, I made a plot of how changes to the basis impact the regression for a single day (25 Jan 2019).

The code:df1.

iloc[0:, 0] = -.

95iterpreds = list()a = list()for x in range(0,10): df1.

iloc[0, 0] += .

1 b = df1.

at[0, 'Basis'] iterpreds.

append(m.

predict(df1).

astype(float).

item(0)) a.

append(b)plt.

plot(a, iterpreds)basis on x axis, price on yMuch more can be done!.We should expand our model to include all available columns, see if there is a second commodity we can add to the model to explore whether other crop prices move in sync, and plot the impact of changes to other variables over time.

References:[1] https://course.

fast.

ai/ml.

html[2] https://forums.

fast.

ai/t/fastai-v0-7-install-issues-thread/24652[3] https://docs.

quandl.

com/docs/getting-startedAdditional code is on my github: www.

github.

com/matthewarthur and my LinkedIn is https://www.

linkedin.

com/in/matt-a-8208aaa/.

.

. More details

Leave a Reply