From CSVs to Tables: Infer Schema Data Types From Raw Spreadsheets

I want to accomplish this while writing any app, at the drop of a hat without warning.

I don’t want to install Hadoop and have Java errors coming back at me through my terminal.

Don’t EVER let me see Java in my terminal.

UGH:py4j.

protocol.

Py4JJavaError: An error occurred while calling o43.

count.

: java.

lang.

IllegalArgumentException: Unsupported class file major version 55 at org.

apache.

xbean.

asm6.

ClassReader.

<init>(ClassReader.

java:166) at org.

apache.

xbean.

asm6.

ClassReader.

<init>(ClassReader.

java:148) at org.

apache.

xbean.

asm6.

ClassReader.

<init>(ClassReader.

java:136) at org.

apache.

xbean.

asm6.

ClassReader.

<init>(ClassReader.

java:237) at org.

apache.

spark.

util.

ClosureCleaner$.

getClassReader(ClosureCleaner.

scala:49) at org.

apache.

spark.

util.

FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.

apply(ClosureCleaner.

scala:517) at org.

apache.

spark.

util.

FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.

apply(ClosureCleaner.

scala:500) at scala.

collection.

TraversableLike$WithFilter$$anonfun$foreach$1.

apply(TraversableLike.

scala:733) at scala.

collection.

mutable.

HashMap$$anon$1$$anonfun$foreach$2.

apply(HashMap.

scala:134) at scala.

collection.

mutable.

HashMap$$anon$1$$anonfun$foreach$2.

apply(HashMap.

scala:134) at scala.

collection.

mutable.

HashTable$class.

foreachEntry(HashTable.

scala:236) at scala.

collection.

mutable.

HashMap.

foreachEntry(HashMap.

scala:40) at scala.

collection.

mutable.

HashMap$$anon$1.

foreach(HashMap.

scala:134) at scala.

collection.

TraversableLike$WithFilter.

foreach(TraversableLike.

scala:732) at org.

apache.

spark.

util.

FieldAccessFinder$$anon$3.

visitMethodInsn(ClosureCleaner.

scala:500) at org.

apache.

xbean.

asm6.

ClassReader.

readCode(ClassReader.

java:2175) at org.

apache.

xbean.

asm6.

ClassReader.

readMethod(ClassReader.

java:1238) at org.

apache.

xbean.

asm6.

ClassReader.

accept(ClassReader.

java:631)Python’s “tableschema” LibraryThankfully, there’s at least one other person out there who has shared this desire.

That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy.

Let’s give it a go:import csvfrom tableschema import Tabledata = 'data/fake.

csv'schema = infer(data, limit=500, headers=1, confidence=0.

85)print(schema)If our dataset is particularly large, we can use the limit attribute to limit the sample size to the first X number of rows.

Another nice feature is the confidence attribute: a 0-1 ratio for allowing casting errors during the inference.

Here's what comes back:{ "fields": [{ "name": "id", "type": "integer", "format": "default" }, { "name": "initiated", "type": "string", "format": "default" }, { "name": "hiredate", "type": "date", "format": "default" }, { "name": "email", "type": "string", "format": "default" }, { "name": "firstname", "type": "string", "format": "default" }, { "name": "lastname", "type": "string", "format": "default" }, { "name": "title", "type": "string", "format": "default" }, { "name": "department", "type": "string", "format": "default" }, { "name": "location", "type": "string", "format": "default" }, { "name": "country", "type": "string", "format": "default" }, { "name": "type", "type": "string", "format": "default" }], "missingValues": [""]}Hey, that’s good enough for me!.Now let’s automate the shit out this.

Creating a Table in SQLAlchemy With Our New SchemaI’m about to throw a bunch in your face right here.

Here’s a monster of a class:from sqlalchemy import create_engineimport configimport pandas as pdimport psycopg2from tableschema import Table, infer, Schemafrom functions.

recursivejson import extract_valuesfrom sqlalchemy.

types import Integer, Text, Dateclass CreateTablesFromCSVs: """Infer a table schema from a CSV.

"""__uri = config.

PG_URI __engine = create_engine(__uri, convert_unicode=True, echo=True) __data = 'data/fake.

csv' @classmethod def get_data(cls): """Pull latest data.

""" test_df = pd.

read_csv(cls.

__data, header=0, encoding='utf-8') return test_df@classmethod def get_schema_from_csv(cls, csv): """Infers schema from CSV.

""" table = Table(csv) table.

infer(limit=500, confidence=0.

55) schema = table.

schema.

descriptor names = cls.

get_column_names(schema, 'name') datatypes = cls.

get_column_datatypes(schema, 'type') schema_dict = dict(zip(names, datatypes)) return schema_dict@classmethod def get_column_names(cls, schema, key): """Get names of columns.

""" names = extract_values(schema, key) return names@classmethod def get_column_datatypes(cls, schema, key): """Convert schema to recognizable by SQLAlchemy.

""" values = extract_values(schema, key) for i, value in enumerate(values): if value == 'integer': values[i] = Integer elif value == 'string': values[i] = Text elif value == 'date': values[i] = Date return values @classmethod def create_new_table(cls, data, schema): """Create new table from CSV and generated schema.

""" workday_table.

to_sql('faketable', con=cls.

__engine, schema='testschema', if_exists='replace', chunksize=300, dtype=schema) data = CreateTablesFromCSVs.

get_schema_from_csv()schema = CreateTablesFromCSVs.

get_schema_from_csv(data)CreateTablesFromCSVs.

create_new_table(data, schema)The first thing worth mentioning is I’m importing a function from my personal secret library to extract values from JSON objects.

I’ve spoken about it before.

Let’s break down this class:get_data() reads our CSV into a Pandas DataFrame.

get_schema_from_csv() kicks off building a Schema that SQLAlchemy can use to build a table.

get_column_names() simply pulls column names as half our schema.

get_column_datatypes() manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes.

create_new_table Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.

Promising Potential, Room to GrowWhile tableschema works some of the time, it isn’t perfect.

The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.

Just wait until next time when we introduce Google BigQuery into the mix.

Originally published at hackersandslackers.

com on January 23, 2019.

.. More details

Leave a Reply