A Simple Approach To Templated SQL Queries In Python

A Simple Approach To Templated SQL Queries In PythonAccelerate and automate your analyticsSergei IzrailevBlockedUnblockFollowFollowingApr 9There are numerous situations in which one would want to insert parameters in a SQL query, and there are many ways to implement templated SQL queries in python.

Without going into comparing different approaches, this post explains a simple and effective method for parameterizing SQL using JinjaSql.

Besides many powerful features of Jinja2, such as conditional statements and loops, JinjaSql offers a clean and straightforward way to parameterize not only the values substituted into the where and in clauses, but also SQL statements themselves, including parameterizing table and column names and composing queries by combining whole code blocks.

Basic parameter substitutionLet’s assume we have a table transactions holding records about financial transactions.

The columns in this table could be transaction_id, user_id, transaction_date, and amount.

To compute the number of transactions and the total amount for a given user on a given day, a query directly to the database may look something likeselect user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = 1234 and transaction_date = '2019-03-02'Here, we assume that the database will automatically convert the YYYY-MM-DD format of the string representation of the date into a proper date type.

If we want to run the query above for an arbitrary user and date, we need to parameterize the user_id and the transaction_date values.

In JinjaSql, the corresponding template would simply becomeselect user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = {{ uid }} and transaction_date = {{ tdate }}Here, the values were replaced by placeholders with python variable names enclosed in double curly braces {{ }}.

Note that the variable names uid and tdate were picked only to demonstrate that they are variable names and don't have anything to do with the column names themselves.

A more readable version of the same template stored in a python variable isuser_transaction_template = '''select user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = {{ user_id }} and transaction_date = {{ transaction_date }}'''Next, we need to set the parameters for the query.

params = { 'user_id': 1234, 'transaction_date': '2019-03-02',}Now, generating a SQL query from this template is straightforward.

from jinjasql import JinjaSqlj = JinjaSql(param_style='pyformat')query, bind_params = j.

prepare_query(user_transaction_template, params)If we print query and bind_params, we find that the former is a parameterized string, and the latter is an OrderedDict of parameters:>>> print(query)select user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = %(user_id)s and transaction_date = %(transaction_date)s>>> print(bind_params)OrderedDict([('user_id', 1234), ('transaction_date', '2018-03-01')])Running parameterized queriesMany database connections have an option to pass bind_params as an argument to the method executing the SQL query on a connection.

For a data scientist, it may be natural to get results of the query in a Pandas data frame.

Once we have a connection conn, it is as easy as running read_sql:import pandas as pdfrm = pd.

read_sql(query, conn, params=bind_params)See the JinjaSql docs for other examples.

From a template to the final SQL queryIt is often desired to fully expand the query with all parameters before running it.

For example, logging the full query is invaluable for debugging batch processes because one can copy-paste the query from the logs directly into an interactive SQL interface.

It is tempting to substitute the bind_params into the query using python built-in string substitution.

However, we quickly find that string parameters need to be quoted to result in proper SQL.

For example, in the template above, the date value must be enclosed in single quotes.

>>> print(query % bind_params)​select user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = 1234 and transaction_date = 2018-03-01To deal with this, we need a helper function to correctly quote parameters that are strings.

We detect whether a parameter is a string, by callingfrom six import string_typesisinstance(value, string_types)This works for both python 3 and 2.

7.

The string parameters are converted to the str type, single quotes in the names are escaped by another single quote, and finally, the whole value is enclosed in single quotes.

from six import string_typesdef quote_sql_string(value): ''' If `value` is a string type, escapes single quotes in the string and returns the string enclosed in single quotes.

''' if isinstance(value, string_types): new_value = str(value) new_value = new_value.

replace("'", "''") return "'{}'".

format(new_value) return valueFinally, to convert the template to proper SQL, we loop over bind_params, quote the strings, and then perform string substitution.

from copy import deepcopydef get_sql_from_template(query, bind_params): if not bind_params: return query params = deepcopy(bind_params) for key, val in params.

items(): params[key] = quote_sql_string(val) return query % paramsNow we can easily get the final query that we can log or run interactively:>>> print(get_sql_from_template(query, bind_params))select user_id , count(*) as num_transactions , sum(amount) as total_amountfrom transactionswhere user_id = 1234 and transaction_date = '2018-03-01'Putting it all together, another helper function wraps the JinjaSql calls and simply takes the template and a dict of parameters, and returns the full SQL.

from jinjasql import JinjaSql​def apply_sql_template(template, parameters): ''' Apply a JinjaSql template (string) substituting parameters (dict) and return the final SQL.

''' j = JinjaSql(param_style='pyformat') query, bind_params = j.

prepare_query(template, parameters) return get_sql_from_template(query, bind_params)Compute statistics on a columnComputing statistics on the values stored in a particular database column is handy both when first exploring the data and for data validation in production.

Since we only want to demonstrate some features of the templates, for simplicity, let’s just work with integer columns, such as the column user_id in the table transactions above.

For integer columns, we are interested in the number of unique values, min and max values, and the number of nulls.

Some columns may have a default of say, -1, the drawbacks of which are beyond the scope of this post, however, we do want to capture that by reporting the number of default values.

Consider the following template and function.

The function takes the table name, the column name and the default value as arguments, and returns the SQL for computing the statistics.

COLUMN_STATS_TEMPLATE = '''select {{ column_name | sqlsafe }} as column_name , count(*) as num_rows , count(distinct {{ column_name | sqlsafe }}) as num_unique , sum(case when {{ column_name | sqlsafe }} is null then 1 else 0 end) as num_nulls {% if default_value %} , sum(case when {{ column_name | sqlsafe }} = {{ default_value }} then 1 else 0 end) as num_default {% else %} , 0 as num_default {% endif %} , min({{ column_name | sqlsafe }}) as min_value , max({{ column_name | sqlsafe }}) as max_valuefrom {{ table_name | sqlsafe }}'''​​def get_column_stats_sql(table_name, column_name, default_value): ''' Returns the SQL for computing column statistics.

Passing None for the default_value results in zero output for the number of default values.

''' # Note that a string default needs to be quoted first.

params = { 'table_name': table_name, 'column_name': column_name, 'default_value': quote_sql_string(default_value), } return apply_sql_template(COLUMN_STATS_TEMPLATE, params)This function is straightforward and very powerful because it applies to any column in any table.

Note the {% if default_value %} syntax in the template.

If the default value that is passed to the function is None, the SQL returns zero in the num_default field.

The function and template above will also work with strings, dates, and other data types if the default_value is set to None.

However, to handle different data types more intelligently, it is necessary to extend the function to also take the data type as an argument and build the logic specific to different data types.

For example, one might want to know the min and max of the string length instead of the min and max of the value itself.

Let’s look at the output for the transactions.

user_id column.

>>> print(get_column_stats_sql('transactions', 'user_id', None))select user_id as column_name , count(*) as num_rows , count(distinct user_id) as num_unique , sum(case when user_id is null then 1 else 0 end) as num_nulls , 0 as num_default , min(user_id) as min_value , max(user_id) as max_valuefrom transactionsNote that blank lines appear in place of the {% %} clauses and could be removed.

SummaryWith the helper functions above, creating and running templated SQL queries in python is very easy.

Because the details of parameter substitution are hidden, one can focus on building the template and the set of parameters and then call a single function to get the final SQL.

One important caveat is the risk of code injection.

For batch processes, it should not be an issue, but using the sqlsafe construct in web applications could be dangerous.

The sqlsafe keyword indicates that the user (you) is confident that no code injection is possible and takes responsibility for simply putting whatever string is passed in the parameters directly into the query.

On the other hand, the ability to put an arbitrary string in the query allows one to pass whole code blocks into a template.

For example, instead of passing table_name='transactions' above, one could pass '(select * from transactions where transaction_date = 2018-03-01) t', and the query would still work.

The code in this post is licensed under the MIT License.

This post first appeared on the Life Around Data blog.

Photo and image by Sergei Izrailev.. More details

Leave a Reply