Databricks: How to Save Data Frames as CSV Files on Your Local Computer

Databricks: How to Save Data Frames as CSV Files on Your Local ComputerDeborah KewonBlockedUnblockFollowFollowingMay 30Photo credit to Mika Baumeister from UnsplashWhen I work on Python projects dealing with large datasets, I usually use Spyder.

The environment of Spyder is very simple; I can browse through working directories, maintain large code bases and review data frames I create.

However, if I don’t subset the large data, I constantly face memory issues and struggle with very long computational time.

For this reason, I occasionally use Databricks.

Databricks is a Microsoft Azure platform where you can easily parse large amounts of data into “notebooks” and perform Apache Spark-based analytics.

If you want to work with data frames and run models using pyspark, you can easily refer to Databricks’ website for more information.

However, while working on Databricks, I noticed that saving files in CSV, which is supposed to be quite easy, is not very straightforward.

In the following section, I would like to share how you can save data frames from Databricks into CSV format on your local computer with no hassles.

1.

Explore the Databricks File System (DBFS)From Azure Databricks home, you can go to “Upload Data” (under Common Tasks)→ “DBFS” → “FileStore”.

DBFS FileStore is where you will create folders and save your data frames into CSV format.

By default, FileStore has three folders: import-stage, plots, and tables.

2.

Save a data frame into CSV in FileStoreSample.

coalesce(1).

write.

format(“com.

databricks.

spark.

csv”).

option(“header”, “true”).

save(“dbfs:/FileStore/df/Sample.

csv”)Using the above code on the notebook, I created a folder “df” and saved a data frame “Sample” into CSV.

It is important to use coalesce(1) since it saves the data frame as a whole.

At the end of this article, I will also demonstrate what happens when you don’t include coalesce(1) in the code.

Once you convert your data frame into CSV, go to your FileStore.

You will see the folder and files you create.

The “part-00000” is the CSV file I had to download on my local computer.

I copied the path after /FileStore/ for step 3.

3.

Download the CSV file on your local computerIn order to download the CSV file located in DBFS FileStore on your local computer, you will have to change the highlighted URL to the following:https://westeurope.

azuredatabricks.

net/files/df/Sample.

csv/part-00000-tid-8365188928461432060–63d7293d-3b02–43ff-b461-edd732f9e06e-4704-c000.

csv?o=3847738880082577As you noticed, the CSV path in bold (df/Sample.

csv/part-00000-tid-8365188928461432060–63d7293d-3b02–43ff-b461-edd732f9e06e-4704-c000.

csv) is from step 2.

The number (3847738880082577) is from the original URL.

When you change the URL as described above and press enter, the CSV file will be automatically downloaded on your local computer.

dbutils.

fs.

rm(“/FileStore/df”,True)If you want to delete files in the FileStore, you can simply use the above code.

Once it is deleted, you will get the comment “True”.

As promised earlier, here are the details when you don’t include coalesce(1) in the code.

By default, Databricks saves data into many partitions.

Coalesce(1) combines all the files into one and solves this partitioning problem.

However, it is not a good idea to use coalesce (1) or repartition (1) when you deal with very big datasets (>1TB, low velocity) because it transfers all the data to a single worker, which causes out of memory issues and slow processing.

In this case, parsing by column or distributing to more than a single worker is recommended.

Thank you for reading!.If you liked what I did, don’t hesitate to follow me on GitHub and connect with me on Linkedin.

Also, feel free to check out my other article(s):How to Get Twitter Notifications on Currency Exchange Rate: Web Scraping and Automation.

. More details

Leave a Reply