Building Token Recommender in Google Cloud Platform

Building Token Recommender in Google Cloud PlatformEvgeny MedvedevBlockedUnblockFollowFollowingJan 14In this article I will guide you through the process of creating an ERC20 token recommendation system built with TensorFlow, Cloud Machine Learning Engine, Cloud Endpoints, and App Engine.

The solution is based on the tutorial article by Google.

The data used for training the recommendation system is taken from the public Ethereum dataset in BigQuery.

The goal of this article is just to go end-to-end, all the way to deployment.

Later we’ll share another post where we improve the quality of recommendations and demo it in a web app.

The article is broken down into the following parts:Intro to collaborative filtering for recommendation systems.

Creating and training the model for token recommendation system.

Tuning hyperparameters in Cloud ML Engine.

Deploying the recommendation system to Cloud Endpoints and App Engine.

Intro to collaborative filtering for recommendation systemsThe collaborative filtering technique is a powerful method for generating user recommendations.

Collaborative filtering relies only on observed user behavior to make recommendations — no profile data or content access is necessary.

The technique is based on the following observations:Users who interact with items in a similar manner (for example, buying the same tokens or viewing the same articles) share one or more hidden preferences.

Users with shared preferences are likely to respond in the same way to the same items.

The collaborative filtering problem can be solved using matrix factorization.

Suppose you have a matrix consisting of user IDs and their interactions with your products.

Each row corresponds to a unique user, and each column corresponds to an item.

The item could be a product in a catalog, an article, or a token.

Each entry in the matrix captures a user’s rating or preference for a single item.

The rating could be explicit, directly generated by user feedback, or it could be implicit, based on user purchases or the number of interactions with an article or a token.

Ratings matrixThe matrix factorization method assumes that there is a set of attributes common to all items, with items differing in the degree to which they express these attributes.

Furthermore, the matrix factorization method assumes that each user has their own expression for each of these attributes, independent of the items.

In this way, a user’s item rating can be approximated by summing the user’s strength for each attribute weighted by the degree to which the item expresses this attribute.

These attributes are sometimes called hidden or latent factors.

To translate the existence of latent factors into the matrix of ratings, you do this: for a set of users U of size u and items I of size i, you pick an arbitrary number k of latent factors and factorize the large matrix R into two much smaller matrices X (the “row factor”) and Y (the “column factor”).

Matrix X has dimension u × k, and Y has dimension k × i.

Approximating the ratings matrix with row and column factorsTo calculate the rating of user u for item i, you take the dot product of the two vectors.

The loss function can be defined as Root-mean-square error (RMSE) between the actual rating and the rating calculated from the latent factors.

For our token recommender we will use the percentage of supply a user is holding for a particular token, as the implicit rating in the user rating matrix.

Creating and training the model for token recommendation systemCheck out the code and install the dependencies:wget https://repo.

continuum.

io/miniconda/Miniconda2-latest-MacOSX-x86_64.

shbash Miniconda2-latest-MacOSX-x86_64.

shgit clone https://github.

com/blockchain-etl/token-recommendercd token-recommenderconda create -n token_recommenderconda install -n token_recommender –file conda.

txtsource activate token_recommenderpip install -r requirements.

txtpip install tensorflow==1.

4.

1Query token ratings from BigQuery:Run the following query in BigQuery and export the results to a GCS bucket e.

g.

gs://your_bucket/data/token_balances.

csv#standardSQLwith top_tokens as ( select token_address, count(1) as transfer_count from `bigquery-public-data.

ethereum_blockchain.

token_transfers` as token_transfers group by token_address order by transfer_count desc limit 1000),token_balances as ( with double_entry_book as ( select token_address, to_address as address, cast(value as float64) as value, block_timestamp from `bigquery-public-data.

ethereum_blockchain.

token_transfers` union all select token_address, from_address as address, -cast(value as float64) as value, block_timestamp from `bigquery-public-data.

ethereum_blockchain.

token_transfers` ) select double_entry_book.

token_address, address, sum(value) as balance from double_entry_book join top_tokens on top_tokens.

token_address = double_entry_book.

token_address where address != '0x0000000000000000000000000000000000000000' group by token_address, address having balance > 0),token_supplies as ( select token_address, sum(balance) as supply from token_balances group by token_address)select token_balances.

token_address, token_balances.

address as user_address, balance/supply * 100 as ratingfrom token_balancesjoin token_supplies on token_supplies.

token_address = token_balances.

token_addresswhere balance/supply * 100 > 0.

001The above SQL queries top 1000 tokens by transfers count, calculates the balances for each token, and outputs (token_address, user_address, rating) triples.

Rating there is calculated as the percentage of supply held by the user.

This filter — where balance/supply * 100 > 0.

001 — prevents airdrops appearing in the result.

Understand the code structureThe model code is contained in the wals_ml_engine directory.

The code's high-level functionality is implemented by the following files:mltrain.

sh — launches various types of Cloud Machine Learning Engine jobs.

task.

py — parses the arguments for the Cloud Machine Learning Engine job and executes training.

model.

py — loads the dataset; creates two sparse matrices from the data, one for training and one for testing; executes WALS on the training sparse matrix of ratings.

wals.

py— creates the WALS model; executes the WALS algorithm; calculates the root-mean-square error (RMSE) for a set of row/column factors and a ratings matrix.

The csv file is loaded in the model.

py file:headers = ['token_address', 'user_address', 'rating']balances_df = pd.

read_csv(input_file, sep=',', names=headers, header=0, dtype={ 'token_address': np.

str, 'user_address': np.

str, 'rating': np.

float32, })Then the following arrays are created:an array of user addresses,an array of token addresses,an array of triplets with 0-based user and token indexes and the corresponding ratings.

These triplets are then randomly split into test and train datasets and converted to sparse matrices:test_set_size = len(ratings) / TEST_SET_RATIOtest_set_idx = np.

random.

choice(xrange(len(ratings)), size=test_set_size, replace=False)test_set_idx = sorted(test_set_idx)# sift ratings into train and test setsts_ratings = ratings[test_set_idx]tr_ratings = np.

delete(ratings, test_set_idx, axis=0)# create training and test matrices as coo_matrix'su_tr, i_tr, r_tr = zip(*tr_ratings)tr_sparse = coo_matrix((r_tr, (u_tr, i_tr)), shape=(n_users, n_items))u_ts, i_ts, r_ts = zip(*ts_ratings)test_sparse = coo_matrix((r_ts, (u_ts, i_ts)), shape=(n_users, n_items))The WALS model is created in wals_model method in wals.

py, and the factorization is done in simple_train method in the same file.

The result are the row and column factors in numpy format.

Train the model locally and in Google ML EngineTo train the model locally run the following command, specifying the path to the CSV file exported on the previous step:.

/mltrain.

sh local gs://your_bucket/data/token_balances.

csvThe output should look like the following:INFO:tensorflow:Train Start: 2019-01-10 23:22:06INFO:tensorflow:Train Finish: 2019-01-10 23:22:12INFO:tensorflow:train RMSE = 0.

76INFO:tensorflow:test RMSE = 0.

95The RMSE corresponds to the average error in the predicted ratings compared to the test set.

On average, each rating produced by the algorithm is within ± 0.

95 percentage points of the actual user rating in the test set.

The WALS algorithm performs much better with tuned hyperparameters, as shown in the following section.

To run it in Cloud ML Engine:.

/mltrain.

sh train gs://your_bucket/data/token_balances.

csvYou can monitor the status and output of the job on the Jobs page of the ML Engine section of the GCP Console.

Click Logs to view the job output.

After factorization, the factor matrices are saved in four separate files in numpy format so they can be used to perform recommendations:user.

npy —an array of user addresses used for mapping user indexes to user addressesitem.

npy — an array of token addresses used for mapping token indexes to token addressesrow.

npy — users latent factorscol.

npy — tokens latent factorsWhen training locally, you can find those files under wals_ml_engine/jobs directory.

To test out the recommendations use the following code:import numpy as npfrom model import generate_recommendationsuser_address = '0x8c373ed467f3eabefd8633b52f4e1b2df00c9fe8'already_rated = ['0x006bea43baa3f7a6f765f14f10a1a1b08334ef45','0x5102791ca02fc3595398400bfe0e33d7b6c82267','0x68d57c9a1c35f63e2c83ee8e49a64e9d70528d25','0xc528c28fec0a90c083328bc45f587ee215760a0f']k = 5model_dir = '.

/jobs/wals_ml_local_20190107_235006'user_map = np.

load(model_dir + "/model/user.

npy")item_map = np.

load(model_dir + "/model/item.

npy")row_factor = np.

load(model_dir + "/model/row.

npy")col_factor = np.

load(model_dir + "/model/col.

npy")user_idx = np.

searchsorted(user_map, user_address)user_rated = [np.

searchsorted(item_map, i) for i in already_rated]recommendations = generate_recommendations(user_idx, user_rated, row_factor, col_factor, k)tokens = [item_map[i] for i in recommendations]print(tokens)Tuning hyperparameters in Cloud ML EngineYou can find the configuration file for hyperparameters tuning here.

latent_factors — the number of latent factors (min:5, max:50).

regularization — L2 Regularization constant (min: 0.

001, max: 10.

0).

unobs_weight — unobserved weight (min: 0.

001, max: 5.

0).

feature_wt_exp — feature weight exponent (min: 0.

0001, max: 10).

num_iters — number of alternating least squares iterations (min: 10, max 20)To tune the hyperparameters, first change the BUCKET variable in mltrain.

sh to your bucket.

Then run the following command:.

/mltrain.

sh tune gs://your_bucket/data/token_balances.

csvYou can see the progress of tuning in Cloud ML Engine console.

The results of hyperparameter tuning are stored in the Cloud ML Engine job data, which you can access in the Jobs page.

The job results include the best RMSE score across all trials of the summary metric.

Below are the best parameters from my tuning, which you can also find in the repository:latent_factors — 22regularization — 0.

12unobs_weight — 0.

001feature_wt_exp —9.

43num_iters — 20The error though is just slightly lower comparing to the default parameters:INFO:tensorflow:train RMSE = 0.

97INFO:tensorflow:test RMSE = 0.

87Deploying the recommendation system to Google App EngineYou can find REST API definition in Swagger format for serving token recommendations in the repository: openapi.

yaml.

The implementation of the API for App Engine is in the main.

py file.

First, prepare to deploy the API endpoint service:cd scripts.

/prepare_deploy_api.

shThe output of this command should look like the following:To deploy: gcloud endpoints services deploy /var/folders/t0/y38g0z2s6jqcnwp8452j026h0000gp/T/tmp.

fIelYqSh8B.

yamlRun the provided command:gcloud endpoints services deploy /var/folders/t0/y38g0z2s6jqcnwp8452j026h0000gp/T/tmp.

fIelYqSh8B.

yamlCreate the bucket where the app will read the model from:export BUCKET=gs://recserve_$(gcloud config get-value project 2> /dev/null)gsutil mb ${BUCKET}Upload the token_balances.

csv file to the bucket:gsutil cp .

/data/token_balances.

csv ${BUCKET}/data/Train the model and upload the model files to the bucket:.

/mltrain.

sh local ${BUCKET}/data/token_balances.

csv –use-optimized –output-dir ${BUCKET}Create an App Engine application:gcloud app create –region=us-central1gcloud app update –no-split-health-checksPrepare to deploy the App Engine application:.

/prepare_deploy_app.

shThe following output appears:To deploy: gcloud -q app deploy .

/app/app_template.

yaml_deploy.

yamlRun the provided command:gcloud -q app deploy .

/app/app_template.

yaml_deploy.

yamlAfter the app is deployed you will be able to query the api: https://${project_id}.

appspot.

com/recommendation?user_address=0x8c373ed467f3eabefd8633b52f4e1b2df00c9fe8&num_recs=5 (replace ${project_id} with your value){ token_addresses: [ "0x8ae4bf2c33a8e667de34b54938b0ccd03eb8cc06", "0x226bb599a12c826476e3a771454697ea52e9e220", "0xcbcc0f036ed4788f63fc0fee32873d6a7487b908", "0xf7b098298f7c69fc14610bf71d5e02c60792894c", "0xc86d054809623432210c107af2e3f619dcfbf652" ]}Next stepsUse USD value instead of fraction of supply for ratingsSplit the dataset based on time a token purchased instead of splitting randomlyVisualize token latent factors in 3-dimensional spaceCompare the recommendation system to a benchmarkTry other error measuresDeploy to Cloud Composer to update the model dailyDeploy a simple demo web app.

. More details

Leave a Reply