One of the main attractions of document databases is the flexibility of the document structure or schema. For the most part, the flexibility in the document schema is beneficial. However, there might be some situations where having some structure to the document might be helpful. This article shows you how to validate your JSON documents against a specified schema using the popular Python library pydantic.

When do you need to validate documents?

A common misconception about using NoSQL databases is that no structures or document schemas are required. In most cases, applications tend to have some constraints for the data even though they may not specifically validate it. For example, there might be some fields in the document that the application depends on for functionality. An application might simply not operate correctly when some of these pydantic JSON fields are missing.

One real-world example of this problem could be an application that reads data from another unreliable application that periodically sends bad data. It would be prudent to highlight any documents that could break the application in such cases. Developers can do this either in the application or at the document level

In this approach, we specify a JSON to pydantic schema for the documents to help identify those that do not match the application specifications at the document level.

Generate JSON testing data

We use the open-source library, Faker, to generate some fake user profiles for this tutorial.

This is the structure of a single document:

To simulate the broken documents, I modify a small portion of the user profiles by deleting some of the mobile phone and mail fields. We aim to identify these records that, in the real world, would be stored in a document database like Couchbase.

I load the generated pydantic data from JSON into a bucket on our hosted Couchbase Capella cluster using the import functionality in the built-in web console UI for our testing. I specify the username field as the key to uniquely identify each document.

How to specify a schema for JSON documents?

In the user profile data, I expect the documents to conform to the following structure in my applications:

  • Mandatory fields:
    • username
    • name
    • phone – with JSON elements for “home” & “mobile”
    • mail
    • website – an array
  • Optional fields:
    • company
    • residence
    • job
    • address

Pydantic is one of the most popular libraries in Python for data validation. The syntax for specifying the schema is similar to using type hints for functions in Python. Developers can specify the schema by defining a model. Pydantic has a rich set of features to do a variety of JSON validations. We will walk through the representation for some user profile document specifications.

One thing to note about pydantic is that, by default, it tries to coerce the data types by doing type conversions when possible—for example, converting string ‘1’ into a numeric 1. However, there is an option to enable strict type checking without performing conversions.

In this code example, you see a basic configuration for the UserProfile schema using pydantic syntax:

Each field is specified along with the expected data type. The optional fields are marked as Optional. An array is specified by the List keyword followed by the desired data type.

The username field needs to be a string, while the company field is an optional string. If we look at the other lines in the snippet, we see the website field is a list of type HttpUrl – one of the many custom types provided by pydantic out of the box. HttpUrl is used to validate that the URL is valid and not random strings. Similarly, there are other fields like emails that we could use to ensure that the email fields are a valid form. 

If you look at the phone field, it is marked as a Phone type which is a custom type that we will define in the next code snippet:

Here we specify that the Phone is composed of two fields that are both strings: home and mobile. This would be checked inside the UserProfile model and interpreted as the UserProfile model containing a phone field that contains home and mobile fields. 

Pydantic also allows us to validate the contents of the data and the type and presence of the data. You can do this by defining functions that validate specific fields. In the above example, we validate the mobile and home fields to check for extensions. If they contain an extension, we do not support it and throw a custom error. These schema errors are then shown to the users doing the pydantic validation.

You can view the schema definition by specifying the Model.schema_json() method as shown here:

JSON documents validation against a pydantic Python schema

Now that we have defined the schema let’s explore how we can validate the documents against the schema.

Validation can be done by using the pydantic parse_obj method of the model. You specify the document as a dictionary and check for validation exceptions. In this case, we fetch all the documents (up to the specified limit) using a Couchbase query and test them one by one and report any errors.

The schema highlights some of the errors observed in the documents:

The document with the ID aarias has extensions in the home field, and the phone mobile field is also missing.

Summary

This user profile example shows how we can easily create custom schemas for our JSON documents. This post also shows how to use the test and validate capabilities of Python and the pydantic module. It is just the tip of the iceberg, though–there are many more types that we can validate. 

Other approaches are also possible; for example, you can validate data at the time of writing. This can be done quite easily by integrating the schema that we defined here with the application and verifying the data before insert/upserting into Couchbase.

The entire code for this demo can be found on Github. The instructions to generate the data and run the scripts can also be found there.

Resources

Author

Posted by Nithish Raghunandanan

Nithish is an engineer who loves to build products that solve real-world problems in short spans of time. He has experienced different areas of the industry having worked in diverse companies in Germany and India. Apart from work, he likes to travel and interact and engage with the tech community through Meetups & Hackathons. In his free time, he likes to try stuff out by hacking things together.

Leave a reply