Vaex: Out of Core Dataframes for Python and Fast Visualization

Vaex does not really care about the file format, as long as you can memory map the data, you will live long and prosper ????.Apache ArrowIs hdf5 not new and sexy enough?.Ok, we support Apache Arrow which also allows memory mapping and interoperability with other languages.So… no pandas ?????There are some issues with pandas that the original author Wes McKinney outlines in his insightful blogpost: “Apache Arrow and the “10 Things I Hate About pandas”..Many of these issues will be tackled in the next version of pandas (pandas2?), building on top of Apache Arrow and other libraries..Vaex starts with a clean slate, while keeping the API similar, and is ready to be used today.Vaex is lazyVaex is not just a pandas replacement..Although it has a pandas-like API for column access when executing an expression such asnp.sqrt(ds.x**2 + ds.y**2), no computations happen..A vaex expression object is created instead, and when printed out it shows some preview values.Calling numpy functions with vaex expression leads to a new expression, which delays a computation and saves RAM.With the expression system, vaex performs calculations only when needed..Also, the data does not need to be local: expressions can be sent over a wire, and statistics can be computed remotely, something that the vaex-server package provides.Virtual columnsWe can also add expressions to a DataFrame, which result in virtual columns..A virtual column behaves like a regular column but occupies no memory..Vaex makes no distinction between real and virtual columns, they are treated on equal footing.Adding a new virtual column to a DataFrame takes no extra memory.What if an expression is really expensive to compute on the fly?.By using Pythran or Numba, we can optimize the computation using manual Just-In-Time (JIT) compilation.Using Numba or Pythran we can JIT our expression to squeeze out a better performance: > 2x faster in this example.JIT-ed expressions are even supported for remote DataFrames (the JIT-ing happens at the server).Got plenty of RAM?.Just materialize the column..You can choose to squeeze out extra performance at the cost of RAM.Materializing a column converts a virtual column into an in-memory array.. More details

Leave a Reply