Sroka — a Python library to simplify data access

Sroka — a Python library to simplify data accessMartyna Urbanek-TrzeciakBlockedUnblockFollowFollowingApr 24Some say that in the data world, getting and preparing the data constitutes 80% of the whole process, leaving the rest to actual analysis and visualisation.

That’s true in many cases, but one thing is certain, combining a few data sources together can be a tedious task.

We, at Fandom, saw this as an area for improvement and built a Python library — Sroka (https://github.

com/Wikia/sroka) to ease our lives.

Sroka is a library that with simple Python functions lets us directly query or download data from various sources (through APIs), in an analysis-ready format.

What is Sroka?Sroka is a bird species in Polish (eng.

magpie).

It is known for loving and collecting all shiny things.

By analogy, Sroka the library, allows you to access all the important data and gather them in one place easily.

Sroka provides a set of custom functions for each data source.

Their descriptions are gathered in respective README.

md files.

It is a product of teamwork, you can find all contributors on github.

The library currently supports following data sources:AWS (Athena and S3)Google Ad ManagerGoogle AnalyticsGoogle SheetsMoatQuboleRubiconWhy we built it?Working in the Ad Engineering team, we have access to many kinds of data — information that we gather in our own data warehouse, ads data from Google Ad Manager, webpage data from Google Analytics, and if that’s not enough — we also have data and dashboards provided by various third-party partners.

Apart from differences in how data is aggregated and what information it provides, each of the APIs that we can use to gather data has its own way to query results.

Each of them requires you to provide a separate set of credentials too.

Sroka keeps them all in a sroka_config.

ini file in your home directory and automatically passes them with your function calls.

Exemplary sroka_config.

ini fileWho can use it?Anyone that uses data and has access to the above mentioned data sources.

We have built it within an analytics “fraction” of the engineering team as it is analysts who used to query all of the sources on a daily basis.

However, the ultimate goal is to make data querying easier and more accessible to a wider audience.

Taking the data out of clickable UI’s into code based analysis encourages engineers to take part in the analysis process too.

ExampleBefore you use sroka, you have to install the library.

You can do it using pip:pip install srokaMost of our analyses are conducted within Jupyter Notebooks.

It is common for us to combine two (or more) data sources.

Sometimes the goal is to combine different information from two sources, other times — we want to make sure that the data matches.

The latter allows us to cross check e.

g.

whether our custom events are sent or whether Google Ad Manager is configured correctly.

Sroka makes it very easy.

We start off with all the right imports, including Sroka functions.

Here we will use data stored in Athena and Google Ad Manager.

# Athena APIfrom sroka.

api.

athena.

athena_api import query_athena, done_athena# GAM APIfrom sroka.

api.

google_ad_manager.

gam_api import get_data_from_admanager# data wranglingimport pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport seaborn as snsimport matplotlib.

dates as mdates# seaborn default plot designsns.

set()We then query the sources directly from notebook.

Simplified data acquisition and wrangling workflow would look like this:# GAM querystart_day = '01'end_day = '31'start_month = '03'end_month = '03'year = '2019'query = """WHERE CUSTOM_TARGETING_VALUE_ID IN ([wiki_ids])"""dimensions = ['DATE']columns = ['TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS']start_date = {'year': year, 'month': start_month, 'day': start_day}stop_date = {'year': year, 'month': end_month, 'day': end_day}print('starting.

')data_raw_gam = get_data_from_admanager(query, dimensions, columns, start_date, stop_date)print('data gathered')# create df copydf_gam = data_raw_gam.

copy()# change column namesdf_gam.

rename( columns={ 'Dimension.

DATE': 'Date', 'Column.

TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS': 'Impressions' }, inplace=True)# change column format to datetime, it is needed to plotdf_gam['Date'] = pd.

to_datetime(df_gam['Date'])# set indexdf_gam.

set_index('Date', inplace=True)# check first rowsdf_gam.

head()Querying and preparing data from Athena:# download Athena datadf_athena = query_athena('''SELECT concat(year, '-', month, '-', day) AS day, count([ad_impressions])FROM [fandom_ads_data_table]WHERE year = '2019' AND month = '03' AND wiki_id IN ([wiki_ids])GROUP BY CONCAT(year, '-', month, '-', day)ORDER BY day ASC''')# change column namesdf_athena.

rename( columns={ 'day': 'Date', 'count[ad_impressions]': 'Impressions' }, inplace=True)# change column format to datetime, it is needed to plotdf_athena['Date'] = pd.

to_datetime(df_athena['Date'])# set indexdf_athena.

set_index('Date', inplace=True)df_athena.

head()Having both data sets prepared, we can visualise and compare the results.

# initiate plotfig, ax = plt.

subplots(figsize=(15,6))ax.

plot(df_athena['Impressions'], label='Athena')ax.

plot(df_gam['Impressions'], label='Google Ad Manager')# plot elements, title, labels etc.

plt.

ylabel('# impressions', fontsize=13)plt.

title('[wiki_id] all ad impressions by source', fontsize=16, pad=20)plt.

xticks(rotation=30)plt.

legend(bbox_to_anchor=(1.

25, 0.

5), frameon=False, fontsize=14)plt.

ylim(0)plt.

yticks([])ax.

xaxis.

set_major_locator(mdates.

DayLocator(interval=3))ax.

xaxis.

set_major_formatter(mdates.

DateFormatter('%Y/%m/%d'));Output plot generated within Jupyter notebook with code above.

With these few code cells we are able to get a lot of insight, with no need to jump between data sources and manually downloading reports.

We can immediately see that both data sources follow the same pattern, it shows seasonality, but also that as the impressions increase, so do the discrepancies between data sources.

Depending on the situation, this is maybe a known issue, or a signal to verify the reporting.

What’s next?We hope that Sroka will help to simplify the process of data acquisition in places outside of Fandom.

If you have any suggestions on how to improve or extend Sroka’s capabilities, feel free to open an issue on GitHub (https://github.

com/Wikia/sroka).

Also if you feel that additional data sources should be included everyone is welcome to contribute and create a PR.

This article was co-authored by Dorota Mierzwa.

. More details

Leave a Reply