How to learn from BigData Files on Low Memory — Incremental Learning

And some of Boosting libraries like XGBoost and LightGBM provide a way to learn incrementally to tackle big data.Here we will look into incremental solutions provided by some of the Boosting algorithms..Data Exploration ^Photo by Panos Sakalakis on UnsplashFirstly we will get a feel of what our data looks like by looking at first few rows by using the command:part = pd.read_csv("train.csv.zip", nrows=10)part.head()By this you will have basic info on how different columns are structured, how to process each column etc..which will hold all corresponding columns.Now to explore data we will go column by column like this:# For dictionary columns you can do:# 'idx' is index of corresponding column in DataFrame.# You can find it by using np.where(col==df.columns)for col in dictionary_columns: df = pd.read_csv("train.csv.zip", usecols = [idx], converters={col: json.loads}) column_as_df = json_normalize(df[col]) # ….Preprocessing ^Photo by rawpixel on UnsplashFor preprocessing data we will make use of the dictionary we made earlier, which has info on which columns we want to keep (as keys) and what methods to apply to each column (as values), to make a method.This method will be called for each batch of data during the incremental learning process.Now one thing to notice here is that we fitted methods (like LabelEncoder’s, Scalars’s etc.) during exploration of whole data column and we will use that to transform data at every incremental step here..Incremental Learning ^Photo by Bruno Nascimento on UnsplashTo read data file incrementally using pandas, you have to use a parameter chunksize which specifies number of rows to read/write at a time.incremental_dataframe = pd.read_csv("train.csv", chunksize=100000) # Number of lines to read.# This method will return a sequential file reader (TextFileReader)# reading 'chunksize' lines every time..To read file from # starting again, you will have to call this method again.Then you can train on your data incrementally using XGBoost¹ or LightGBM..For LightGBM you have to pass in a argument keep_training_booster=True to its .train method and three arguments to XGBoost's .train method.# First one necessary for incremental learning:lgb_params = { 'keep_training_booster': True, 'objective': 'regression', 'verbosity': 100,}# First three are for incremental learning:xgb_params = { 'update':'refresh', 'process_type': 'update', 'refresh_leaf': True, 'silent': False, }On each step we will save our estimator and then pass it as an argument during next step.# For saving regressor for next use.lgb_estimator = Nonexgb_estimator = Nonefor df in incremental_dataframe: df = preprocess(df) xtrain, ytrain, xvalid, yvalid = # Split data as you like lgb_estimator = lgb.train(lgb_params, # Pass partially trained model: init_model=lgb_estimator, train_set=lgb.Dataset(xtrain, ytrain), valid_sets=lgb.Dataset(xvalid, yvalid), num_boost_round=10) xgb_model = xgb.train(xgb_params, dtrain=xgb.DMatrix(xtrain, ytrain), evals=(xgb.DMatrix(xvalid, yvalid),"Valid"), # Pass partially trained model: xgb_model = xgb_estimator) del df, xtrain, ytrain, xvalid, yvalid gc.collect()CatBoost's incremental learning method is in progress.²To speed things up a bit more and if your chunks a still sufficiently big, you can parallelize your preprocessing method using Python's multiprocessing library functions like this:n_jobs = 4for df in incremental_dataframe: p = Pool(n_jobs) f_ = p.map(preprocess, np.array_split(df, n_jobs)) f_ = pd.concat(f_, axis=0, ignore_index=True) p.close() p.join() # And then your model training …For an introduction on Parallel programming in Python read my post here.5.. More details

Leave a Reply