Tidying Up Pandas

If you want to apply the function row by row, you’ll have to couple rowwise with mutate.# R# my_function does not take vectorised input of the entire column # this will failiris %>% mutate(new_column = my_function(sepal.width, sepal.length)) %>% rowwise %>% mutate(new_column = my_function(sepal.width, sepal.length))To achieve the same using the .assign method, you should nest an apply inside the function.# Pythondef do_something_string(col): #set_trace() if re.search(r".*(osa)$", col): value = "is_setosa" else: value = "not_setosa" return value iris = iris.assign( transformed_species = lambda df: df["species"] .apply(do_something_string) )If you’re lazy, you could just chain two anonymous functions together.# Pythoniris = iris.assign( transformed_species = lambda df: df.species.apply(do_something_string))ApplyFrom R’s apply help docs:apply(X, MARGIN, FUN, …)Where the value of MARGIN takes either 1 or 2 for (rows, columns), ie..if you want to apply to each row, you’ll set the axis as 0.However, in pandas axis refers to what values (index i or columns j) will be used for the applied functions input parameter’s index.be using the 0 refers to the DataFrame’s index and axis 1 refers to the columns.So if you wanted to carry out row-wise operations you could set axis to 0.# Rdf %>% apply(0, function(row){ … do some compute … })Rarely do that now since plyr and later dplyr.However there is no plyr in pandas..So we have to go back to using apply if you want row-wise operations, however, the axis now is 1 not 0..I initially found this very confusing..The reason is because row is a really just a pandas.Series whose index is the parent p andas.DataFame’s columns..Thus in this example, the axis is referring to which axis to set as the index.# python iris.apply(lambda row: do_something(row), axis=1)Interestingly pattern which I do not use in R, is to use apply on columns, in this case pandas.Series objects.# pythoniris.sepal_width.apply(lambda x: x**2) # if you want a fancy progress bar, you could use the tqdm functioniris.sepal_width.apply_progress(lambda x: x**2) # If u need parallel apply # this works with dask underneath import swifteriris.sepal_width.swifter.apply(lambda x : x**2)In R, one of the common idioms, which I keep going back to for a parallel version of groupby is as follows:# Runique_list %>% lapply(function(x){ ….df %>% filter(col == x) %>% do_something() # do something to the subset … }) %>% do.call(rbind,.)If you want a parallel version you’ll just have to change the lapply to mclapply.Additionally, there’s mclapply from the parallel /snow library in R.# Rncores = 10 # the number of cores unique_list %>% mclapply(function(x){ ….df %>% filter(col == x) %>% do_something() # do something to the subset … }, mc.cores=ncores) %>% do.call(rbind,.)Separately, in pySpark, you can split the whole table into partitions and do the manipulations in parallel.# Python (pyspark)( dd.from_pandas(my_df,npartitions=nCores) .map_partitions(lambda df : df.apply( lambda x : nearest_street(x.lat,x.lon),axis=1)) .compute(get=get) # imports at the end)To achieve the same, we could use dask, or an even higher level wrapper from using theswiftapply module.# Python# you can easily vectorise the example using by adding the `swift` method before `.apply` series.swift.apply()Group byThe .groupby method in pandas is equivalent to R function dplyr::group_by returning a DataFrameGroupBy object.In Tidyverse there’s the ungroup function to ungroup grouped DataFrames, in order to achieve the same, there does not exists a1-to-1 mappable function.One way is to complete the groupby -> apply (two-step process) and feeding apply with an identity function apply(lambda x: x)..Which is an identity function.SummariseIn pandas the equivalent of the summarise function is aggregate abbreviated as the agg function..And you will have to couple this with groupby, so it’ll similar again a two step groupby -> agg transformation.# R r_mt = mtcars %>% mutate(model = rownames(mtcars)) %>% select(cyl, model, hp, drat) %>% filter(cyl < 8) %>% group_by(cyl) %>% summarise( hp_mean = mean(hp), drat_mean = mean(drat), drat_std = sd(drat), diff = max(drat) – min(drat) ) %>% arrange(drat_mean) %>% as.data.frameThe same series of transformation written in Python would follow:# Python def transform1(x): return max(x)-min(x) def transform2(x): return max(x)+5 py_mt = ( mtcars .loc[:,["cyl", "model", "hp", "drat"]] #select .query("cyl < 8") #filter .groupby("cyl") #group_by .agg({ #summarise, agg is an abbreviation of aggregation 'hp':'mean', 'drat':[ 'mean', 'std', transform1, transform2] # hmmmmmm }) .sort_values(by=[("drat", "mean")]) #multindex sort (unique to pandas) ) py_mtMergeNatively, R supports the merge function and similarly in Pandas there’s the pd.merge function.# R df1 %>% merge(df2, by="common_column")# Pythondf1.merge(df2, how="inner", on="common_column")Besides, the merge function, there’s also thejoin function..There exist other joins in dplyr for exampleleft_join, right_join, inner_join and anti_join.InplaceIn R there’s the compound assignment pipe-operator %<>%, which is similar to the inplace=True argument in some pandas functions but not all. 🙁 Apparently Pandas is going to remove inplace altogether…DebuggingIn R, we have the browser() function.# R unique(iris$species) %>% lapply(function(s){ browser() iris %>% filter(species == s) …. })It’ll let you step into the function which is extremely useful if you want to do some debugging.In Python, there’s the set_trace function.# Pythonfrom IPython.core.debugger import set_trace( iris .groupby("species") .apply(lambda groupedDF: set_trace()) )Last but not least if you really need to use some R function you could always rely on the rpy2 package..For me I rely on this a lot for plotting.. More details

Leave a Reply