Understanding data engineering jargon: schema and master/branch

Understanding data engineering jargon: schema and master/branch

“If one does not understand a person, one tends to regard him as a fool” — Carl Jung

Well said Mister Jung. It happens a lot of us. When an error occurs during a cross-functional project, we tend to blame the party that we have the least understanding of. My aim with this article is to explain a few technical terms and concepts that you may hear from your data engineering, analyst, or scientist friends. This article will focus on defining schema and branches, what they are, how they are different, and how they work together.

First, a little context

Let’s gain an understanding of what is happening within the overall data structure within your company. In general there will be a lot of different types of data coming in and being stored within your companies “Data Lake”. Think of a data lake as a where any and all information is just dumped into, there’s no real structure to it. The data lake will store data in the format it is received in. There can be payment information from Stripe or PayPal, there can be invoicing information from your billing system, or data fed by a mobile application. Data can also be put into a “Data Warehouse” where the data is more structured and formatted to meet specific user needs. If data lake is a large body of water filling up through streams and rain, then a data warehouse takes all water sources and bottles and arranges them by volume, purity and fluoride levels. I won’t go into the details of how data is structured in the data lake or data warehouse, but suffice to know that this is where the data comes from to build other reports and tables used by other members of the organization.

This collection of data is live, it happens in real-time as events and transactions occur. After the day is over, a process is run to collect the completed days data and start to extract, transform, and load (Also known as ETL) it into the various databases. Users and applications use this organized data within the databases to further create various tables and reports.

Data from different sources are stored into the data lake. From here on a regular interval (usually daily after working hours) the databases retrieve data from the data lake and transform it into a more structured form.

Data from different sources are stored into the data lake. From here on a regular interval (usually daily after working hours) the databases retrieve data from the data lake and transform it into a more structured form.

The ETL process is defined by a set of master instructions (master code) that is utilized by a schema within the database. To understand this last bit, let’s take a deeper dive into schema and master/branch code.

The schema enigma unravelled

A schema can be defined the logic behind the organization of data. The data used to develop a table (or a set of tables) is the schema. It will define how a given table relates to another table, and what a table should be comprised of. A schema is a lot like DNA. It is a part of the database, and also the set of instructions that define how the database is built.

A database will usually contain a main schema used as the source of truth. This schema runs on the master code for creating the tables within it. When data analyst and scientists are querying data, they are generally performing a query on this main schema as it will provide the most up to date information.

In essence, a schema can be physically located on a server. You can literally go to the server farm and find the rack with the disk that contains your data, and that schema will always be organized by the set of instructions guiding it.

Master and branch demystified

Remember when we kept talking about the logic behind a schema, or the master set of instructions? All the code is kept in a code repository that a database will access to determine how to build the tables within itself. You have probably heard of GitHub. GitHub is where the code for schemas live, and within the repository is a master code base, simply referred to as the master. When databases are taking in data from the data lake or other sources, it looks at the master to determine how to arrange its tables. From this set of code the “main schemas” are created.

Now what if someone wanted to change the master code? It can be disruptive to change it, and confusing to others if they don’t know what was changed and when. When a user wants to experiment or needs to update the master, they will create a branch. A branch is a copy made of the master, which the main schemas would not look towards when building tables. Within the branch the user can make whatever changes they want to make without affecting the master. Once they are done with the changes made, they can commit the changes to master. At this point the code will usually go through a peer-review. Once review and approval is received the branch is merged to the master, and now when the schemas update their tables, it will follow the updated set of instructions.

The database here will follow the master code when determining how to structure the data. Feature 1.1 instructs it to simply add the variables a and b. A branch is created during day 2, but the database will still follow the master code. Once the fe…

The database here will follow the master code when determining how to structure the data. Feature 1.1 instructs it to simply add the variables a and b. A branch is created during day 2, but the database will still follow the master code. Once the feature 1.2 is merged prior to day 3, the database will use the updated master code to utilize the new feature where it will subtract the sum of a and b with x.

Since an organization will have many users of the master code, multiple branches can be made simultaneously. Applications such as GitHub will manage the individual user branches, and help reflect changes being made when multiple changes are being made to the master.

But what database does the branch code affect?Great question! We now have an issue of a user needing to see the results of their code changes, but they shouldn’t run the branch code against the current schema as it will alter them! The solution here is to make a copy of the schema that will not automatically update on a scheduled basis. The reasons for this are two-fold:

  1. If there are a large number of users with their own copies of the main schema making changes, then the resources required to regularly update each copy will be very cost inefficient.

  2. We want to avoid making any changes to the main schema as others may be actively reporting on that schema. If we iterate code changes and rebuild a table within a schema, other users could be querying incorrect data.

So with a copy of a given schema, and a branch of the master where code can be freely edited, our engineers and analyst are free to develop and experiment without affecting the main schema. Since the tables within a copied schema will generally not update with the latest data from the data lake, the data within these copied schemas can become “stale”. This isn’t a problem, as the engineer just needs to rebuild the table with the new data. This is a manual process, so to save time engineers will use the “stale” data to iterate over their work until they reach a point where they truly need to test their work with the most up to date data.

Rolling it all together, how does knowing this help you interact with your analysts and engineers?

So the master is the main code base where the logic for the schema sits, a schema is the organized tables that is maintained on a physical disk (probably a server) where copies of the schema can also be made of, and branches of the master code allow engineers and analysts to build new features without affecting the main schema.

Let’s walk through a few use cases with this new found knowledge:

  1. Engineer delivers a newly created revenue report to you, but the totals are incorrectly calculated.
    -> The branch code could be correct, but the schema copy hasn’t been updated to the main schema’s data.
    -> The branch code could contain errors, resulting in an incorrect report, but since the code hasn’t been merged to the master code base there is no risk of other related reports being affected.

  2. A new report tested previously passed, but the next day the very same report failed.
    -> The code may have passed your review and been committed to master, but the code hasn’t gone through further peer-review which is required prior to the code being merged with the master.
    -> The code may have been merged with master, but there might have been a delay in the database running the nightly batch report. The code is still valid, the engineers just need to resolve workload related issues to complete the nightly batch process. (This is why we don’t automatically update every schema in the batch, it could slow down nightly batches).

Thinking through these possibilities can help resolve issues and create better expectations among engineers, analysts and their various business partners.

From a developer side, be transparent in your communication to your business partners to help them understand what stages of development exist, and set check-points so they know when the data is truly ready to review. As a consumer of reports delivered by engineers, be clear and concise in your concerns and ask your engineer how they are ensuring the integrity of the data.

With a greater understanding of these terms, and a little more insight into the data engineering process, I hope each business partner can see more eye to eye when it comes to collaborating.

If you enjoyed reading that article, consider joining my monthly mailing list where I include updates on new articles and projects. All users who sign-up will receive a free resume review and remake for FREE. Sign up here today!

How to design a resume for a career change

How to design a resume for a career change

Being nice vs being good

Being nice vs being good