2019: Full Scale Schema Modeling

We need to go all the way back to 1974 to be kindly reminded by the (Dutch) computing science pioneer Edsger Dijkstra about the importance of “the separation of concerns” in his “On the Role of Scientific Thought”.

Multi-Level Concern Architectures Edsger Dijkstra was working together with prof.

Peter Naur on the European Algol 60 project.

Peter Naur was my professor, when I enrolled the University of Copenhagen in his second year of having the very first chair as professor of the new field called computing science at the university.

I remember prof.

Dijkstra quite well.

So I am thankful to Martijn Evers for reminding me of the separation of concerns philosophy.

I will let Martijn explain the roles of concerns in Data Architectures: “The number of new “data/information” modeling ideas, approaches and designs are proliferating, spurred by technological and business needs.

and as architects we need to get a grip on this new dynamic… This was my cue to change the game.

Instead of designing yet another variety of Data Vault, Anchor Modeling or Fact Based Modeling I wanted to turn things around.

Not the modeling technique/approach should have the center stage, but their underlying concerns.

The varieties of modeling are endless, but the concerns they try to manage are not… Because we see a movement where on the one hand we have austere and lean modeling approaches that don’t function on their own, but just focus on a select set of concerns, and on the other hand we see rather dominant modeling approaches that try to do everything, but not very well.

A one size fit’s all is becoming less and less realistic.

Also, data modeling architecture has always been seen as very static, but that is also changing rapidly.

We can’t fix it all in one go, so gradual change in the way organizations do data modeling/organization is becoming more important.

This all leads me to believe that an agnostic approach to understanding and managing ‘data modeling’ is becoming a necessity”.

In other words, we have to get the concerns out into the light of day.

And we have to understand how they (might) depend on each other.

This will give us a “road map” of several different routes that you can take to solve some specific data delivery challenges.

In my (2016) book about Graph Data Modeling for NoSQL and SQL.

I developed a set of requirements for Data Modeling across the board.

I have now made sort of an amalgamation with Martijn Evers’ concerns, which, like mine, are multi-level.

I propose these 3 levels: Business level concernsSolution level (logical level) concernsImplementation concerns Let us look across those 3 levels in the context of schema design.

Note that property graphs (the subject area of the forthcoming standard GQL standard) are very close to the business concept level (from white-board to database can be very easy), which means that all 3 levels are relevant also in the (not so narrow) context of schema design for graphs.

With respect to the data quadrant matrix most (but not all) concerns have a natural “home” in one of the quadrants (indicated in the list below).

Some concerns are relevant in two or more quadrants.

Also note that there are some “innate inheritance” for the types across the 3 levels.

General Concerns Business facing terminology should prevail at all representational levels (possibly with some syntactical variations), Q1, but manifested in all 4, as relevant Set algebra support at all levels (a favorite hobby-horse of mine), Q1-4Schema first; is a valid concern in many business areas requiring rock-solid, validated, governed, business-approved definitions and data of highest possible quality, Q1Schema less; is also a valid concern in many business areas requiring here-and-now loads of data having meaning and structures, yet to be discovered, Q3, Q4Refinement; (thanks to Martijn for this formulation of an important concern) – … many assume that all models have the same level of refinement, i.

e.

that transformations are semantically equivalent, or even isomorphic.

But some models might contain different levels of abstractions.

For this the model needs to become a 3D matrix.

Traveling in this cube is needed to represent the abstraction and refinement that we see in actual data/information modeling.

Q1, Q4 Business Level Concerns Business facing terminology should prevail at all representational levels (possibly with some syntactical variations), Q1, but manifested in all 4, as relevantSet algebra support at all levels (a favorite hobby-horse of mine), Q1-4Schema first; is a valid concern in many business areas requiring rock-solid, validated, governed, business-approved definitions and data of highest possible quality, Q1Schema less; is also a valid concern in many business areas requiring here-and-now loads of data having meaning and structures, yet to be discovered, Q3, Q4Refinement; (thanks to Martijn for this formulation of an important concern) – … many assume that all models have the same level of refinement, i.

e.

that transformations are semantically equivalent, or even isomorphic.

But some models might contain different levels of abstractions.

For this the model needs to become a 3D matrix.

Traveling in this cube is needed to represent the abstraction and refinement that we see in actual data/information modeling.

Q1, Q4 Solution Level Concerns Platform independence, the solution level data schema details must be independent of the data store platform, Q1Solution derivation; a solution level schema should be derivable from (a subset of) the business concept schema; some concepts become logical business objects, whereas other concepts become properties of those business objects, Q1, Q4Stepwise solution refinement; A solution level schema should be gradually and iteratively extendable (with design decisions), Q1, Q3Graph and subgraphs (incl.

sets), Q1-4Graphs (a collection of nodes and relationships)Subgraphs (subsets of graphs, for example by way of set algebra)Sets (set algebra)Uniqueness; constraints like e.

g.

concatenated business keys should be definable, Q1Identity; Identity is closely related to uniqueness, support for solution level details (decoupled from the implementation details) should also be definable, incl.

support for identifiers and surrogates; cf.

also the next, Q1Updatability; ensure that all functional dependencies have been semantically resolved without dangling properties and relationships, and that all identities are in place (this concern can be relaxed in some contexts), Q1Schema-controlled audit trails and lineage; the solution level schema should be able to contain technical auditing data, Q1Temporal integrity, Q1, Q2Time series aspects, Q1, Q2Property graph types (Q1):Generic nodes (otherwise typeless)Business objects (labeled concepts)Multitype nodes Properties of business objects or type-less nodes (properties are concepts, which share the identity of the business object that owns them)Named, directed relationships with precise cardinalities, where applicableMandatory properties; should be definable, Q1 Physical Level Concerns Intelligent ingestion; inferred or implicit types, load of generic node types, labeled node types and relationship types without explicit up-front schema definitions (but with after the fact physical schema details available), Q3, Q4Easy mapping of transformations; for instance: physical level schema details should be easily mapped back to the solution level schema details, for example by way of visualization of abstractions etc.

, Q1Complete lineage; easy backtracking from the physical schema to the solution schema and further on to the business concept model, Q1Constraint facilities (to support the solution schema details), Q1Indexing facilities for identity and uniqueness and ordering, Q1Temporal integrity support, Q1 Consider all of the above concerns as an initial bet.

There are certainly things to discuss!.Different Routes to an Eventual Schema Consolidated Means lots of Concerns!.As you can see from the list above, tight governance (Q1) equals many concerns; 2 out of 3, in fact.

And there are bound to be quite a few dependencies between them.

There are only 2 concerns, which are not found in Q1: Schema-less, andIntelligent ingestion They are somewhat connected and are sort of antagonistic to the idea of tight governance.

Some concerns are “global”: Set algebra, visualization paradigms, stepwise refinement, graphs and sub-graphs as well as time-series.

There are also concerns, which apply to a couple of quadrants.

Concern Dependencies I made a quick first round of looking at dependencies between concerns.

Some concerns require the presence of other concerns: I left these concerns unconnected to any prerequisites (for the time being): Business facing terminologyBusiness terminologyEasy mapping of transformationsGraph and sub-graphsPlatform independenceRefinementSet algebraSolution independenceStepwise solution refinementTemporal integrityTime series “Prerequisites” in the sense that the “schema designer / user” has to specify something addressed by a concern.

I have almost certainly overlooked a few things; time will show… Some possible scenarios for working with the schema We are now able to answer questions about how the to-be-developed property graph schema facility can be employed.

Just look at the dependency graph up above.

Can we work schema-less (without an upfront schema definition)?.Yes, we can, so long as the “Intelligent ingestion” is in place.

Can we work schema first?.Oh yes, we can.

What are the minimal requirements of working schema first?.Well, we need to be able to specify schema details, which are property graph types.

Add to that that there are several other areas of concern, which can be covered by the schema language, according to the actual context.

The concerns are grouped by type of governance and by type of delivery of “schema products”.

Must I embark on almost defining a business glossary (terminology definitions)?.No, that particular concern is not required by any other concern.

How do I make a business concept model inside the schema in an easy manner?.Well, I must be able to map to standard concept types and standard relationship types.

Those two, in turn, require that we can name the basic dependencies, which become discriminators for creating properties and relationships.

It also requires some business friendly elicitation facility, which in my opinion is visualization (of concept models), but that concern is left optional, at least in the meta architecture depicted in the graph above.

Can I use the schema last approach?.Yes, the design is concerned about lifting schema details upwards from physical to logical solutions and from there to the business facing level.

Dealing with Complexity and Contradictions The forthcoming property graph schema standard that I have chosen as scapegoat for demonstrating the most important parts of the full-scale architecture thinking, is both complex and has a number of contradictory concerns.

The full scale Data Architecture meta framework, starting off with the four quadrants of the two hard dimensions (governance and delivery styles), is a good framework for architecting even a thing like a schema language to be used in many different contexts and in many different development styles.

I am deeply grateful to Ronald Damhof and Martijn Evers and the other members of the Full Scale Data Architecture community for sharing their thoughts and experiences.

And I look forward to learn more from their side.

Keep the good stuff coming, alstublieft!.Note to readers who don’t speak Dutch: “alstublieft” is Dutch for “please”!.

. More details

Leave a Reply