Using Python in Power BI

In this article we will step by step prepare a Python environment, enable Python in Power BI, import data and apply clustering to the data and create custom visuals in Power BI using Python..The article assumes that you have a working Python environment and Power BI installed.Creating the data import script in PythonThe dataset that we’ll be working with is the Boston Housing dataset, which is available in scikit-learn..After this we will apply k-means clustering to try to identify any homogeneous groups within the data.I will not go through the steps taken to create the import code as the main focus of this article is to showcase using Python in Power BI..I recommend creating standalone scripts of the code that you are going to use in order to test and debug any problems prior to using it in Power BI..In addition to virtualenv and pipenv you can also use the conda distribution.In this example I’m using pipenv to manage the Python environment that Power BI will use..We will run the following command in order to create the Python environment and install the required libraries.pipenv install numpy pandas matplotlib seaborn scikit-learnIn case you are using my code from GigHub is is enough to run pipenv install as it will automatically install the required libraries from the existing Pipfile.Lastly, we want to check the actual Python path for this virtual environment since we will need to refer to that from Power BI.pipenv –venvThe location of my virtual environment was C:Users omra.virtualenvspython_and_powerbi-0r4DNLn9 so I copy that path to the clipboard.NOTE: You can use conda to manage a virtual environment for Power BI..Set the Python home directory to the Scripts folder in the path where your virtual environment exists..In my case it is C:Users omra.virtualenvspython_and_powerbi-0r4DNLn9Scripts.Now we are ready to utilize Python code in Power BI so get ready!Power BIImporting data using a Python scriptWe go to the Home tab in the ribbon and click Get data and choose the More option to start our data import..Copy and paste the below code to the Script text area and click OK.import pandas as pdimport numpy as npfrom sklearn.datasets import load_bostonfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeans# utilize the sklearn.datasets package to load the Boston Housing datasetboston = load_boston()# scale the data to same value range first since PCA# is sensitive to the scaling of datasc = StandardScaler()X = sc.fit_transform(boston.data)# create PCA with n_components=2 to allow visualization in 2 dimensionspca = PCA(n_components=2)X_pca = pca.fit_transform(X)# divide data into 5 clusters (refer to .ipynb for motivation)kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10)y_kmeans = kmeans.fit_predict(X_pca)# create pandas dataframe of the housing data for Power BIcolumns = np.append(boston.feature_names, ['MEDV', 'PC1', 'PC2', 'CLUSTER'])data = np.concatenate((boston.data, boston.target.reshape(-1, 1), X_pca, y_kmeans.reshape(-1, 1)), axis=1)df_housing = pd.DataFrame(data=data, columns=columns)# we need to convert all columns as string because of different# decimal separator in Python (.) and Finnish locale (,) that Power BI uses.# comment out below line if Power BI uses dot as a decimal separator.df_housing = df_housing.astype('str')# create pandas dataframe of the pca data for Power BIcolumns = np.append(boston.feature_names, ['VARRATIO'])data = np.concatenate((pca.components_, pca.explained_variance_ratio_.reshape(-1, 1)), axis=1)df_pca = pd.DataFrame(data=data, columns=columns, index=['PC1', 'PC2'])df_pca = df_pca.astype('str')In the next window we are able to choose which datasets to import..Repeat for both datasets.Another option is to modify the import script and remove/comment the rows where we set the data type as strings df_housing = df_housing.astype('str') and df_pca = df_pca.astype('str')..Power BI should identify the data types during import correctly.Power BI with decimal commaWe need to replace the decimal separator from dot (.) to comma (,) in order to set the correct data type..Depending on your screen resolution you might have to hide the script pane to see the visual.Repeat the same step to create a heatmap for the second principal component but use below code snippet instead to use the data from the second principal component and make a vertical visualization that can be placed on the left side of the scatter plot.import matplotlib.pyplot as pltimport seaborn as snsdataset.index = ['PC1', 'PC2']plt.figure(figsize=(2, 8))data = dataset.loc['PC2', :].to_frame().sort_values(by='PC2', ascending=False)sns.heatmap(data, cmap='plasma', square=True, annot=True, cbar=False, xticklabels='')plt.show()I recommend testing your Python visuals in some other tool e.g..Sort the chart in ascending order by the Axis instead of the count.This is how my version of the report turned out.SummaryIn this article we have set-up a Python virtual environment and installed the required libraries for data transformation and visualization.. More details

Leave a Reply