/Company

Help Us Build Our Roadmap | Pydantic

Samuel Colvin avatar
Samuel Colvin
15 mins
2023/06/13

Back in February I announced Pydantic Inc., but I didn't explain what services we were building.

Today I want to provide a little more detail about what we're planning to build, and get your feedback on the components via a short survey.

In return for giving us your honest feedback, you have the option to be added to our early access list, to get invited to the closed beta of our platform once it's ready to use.

At the core of Pydantic's use is always data — Pydantic brings schema and sanity to your data.

The problem is that even with Pydantic in your corner, working with data when it leaves Python often still sucks.

We want to build a data platform to make working with data quick, easy, and enjoyable — where developer experience is our north star.

Before explaining what we're going to build, I should be explicit about what we're not building:

  • We're not building a new database or querying engine
  • We're not pretending that non-developers (or AI) can do the job of a developer — we believe in accelerating developers, not trying to replace them — we'll have CLIs before we have GUIs
  • We're not doing 314 integrations with every conceivable technology
  • Similarly, we're not going to have SDKs for every language — we'll build a few for the languages we know best, and provide a great API for the rest

There are five key components to the Pydantic Data Platform that we're thinking of building. We want your feedback on these components — which you are most excited about, and which you wouldn't use.

We'll use your feedback to decide the order in which we build these features, and to help us build them in a way that works for you.

Here is a brief description of each of these component (each is explained in more detail below):

  1. Python Analytics/Observability — a logging and metrics platform with tight Python and Pydantic integration, designed to make the data flowing through your application more readily usable for both engineering and business analytics. More info...
  2. Data Gateway for object stores — Add validation, transformation and cataloguing in front of object stores like S3, with a schema defined in Pydantic models then validated by our Rust service. More info...
  3. Data Gateway for data warehouses — the same service as above, but integrated with your existing data warehouse. More info...
  4. Schema Catalog — for many, Pydantic already holds the highest fidelity representation of their data schemas. Our Schema Catalog will take this to the next level, serving as an organization-wide single source of truth for those schemas, tracking their changes, and integrating with our other tools and your wider platform. More info...
  5. Dashboards and UI powered by Pydantic models — a managed platform to deploy and control dashboards, auxiliary apps and internal tools where everything from UI components (like forms and tables) to database schema would be defined in Python using Pydantic models. More info...

Please complete a short survey to give us your feedback on these components, and to be added to our waiting list:

Here is a little more detail on each of the features introduced above.

1. Analytics/Observability for Python {#3-1-analyticsobservability-for-python}

For many years observability/logging/analytics platforms have frustrated me for two reasons:

  1. Logging (what exactly happened) and metrics (how often did it happen) are separate. I'm not satisfied with the existing tools for recording and viewing both together, in python or otherwise.
  2. Observability (dev ops./developer insights) and business analytics are completely disconnected, although they're frequently powered by the same data.

I want all four of these views in one place, collected with the same tool.

  • Why can't I collect and view information about recent sign-ups as easily as information about recent exceptions?
  • Why can't logs of transactions give me a view of daily or monthly revenue?

Our Solution

We would give developers 3 tools:

  1. An SDK to collect data in Python applications, with tight Pydantic integration
  2. A dashboard to view that data, either in aggregate or for individual events, including the ability to build reports for other parts of the business
  3. A lightweight Python ORM to query the data, to do whatever you like with it

We see use cases for this tool across many domains — from web applications and APIs where FastAPI is already widely used, to machine learning preparation and LLM validation, where the Pydantic package is already used by OpenAI, LangChain, HuggingFace, Prefect and others.

Our goal is to make it easy enough to integrate (think: setting one environment variable) that you'd install it in a 50-line script, but powerful enough to create monitoring dashboards and business intelligence reports for your core applications.

Here's how this might look:

Analytics — Direct use
from pydantic_analytics import log  # name TBD

async def transaction(payment: PaymentObject):
    ...
    log("transaction-complete amount={payment.amount}", payment)

PaymentObject could be a Pydantic model, dataclass or typed dict. transaction-complete would uniquely identify this event, amount would be shown in the event summary and payment would be visible in the event details.

This would allow you to both view details of the transaction, and aggregate by amount.

Pydantic Integration

The data you want to collect and view is often already passing through Pydantic models, so we can build a service that integrates tightly with Pydantic to extract as much context as possible from the data with minimal developer effort.

Analytics — Pydantic Integration
from datetime import  datetime
from pydantic import BaseModel, EmailStr, SecretStr

class Signup(BaseModel, analytics="record-all"):
    email: EmailStr
    name: str
    password: SecretStr
    signup_ts: datetime


@app.post("/signup/")
def signup(signup_form: Signup):
    # signups are recorded automatically upon validation
    ...

The idea is that you could instrument your application with no code changes, e.g. you could say "record all validations", or whitelist the models you want to record. In addition, fields of type Secret* field can be obfuscated automatically etc.

The analytics config key on models might have the following choices:

  • False — don't record validations using this model
  • 'record-validation-errors' — record validation errors, but not the data
  • 'record-all' — record both validation errors and the data from successful validations
  • 'record-metrics' — record only the timestamp and status of validations
  • omitted — use whatever default is set

Entities

We would allow you to define entities (they might be "Customers" or "Servers", "Background Tasks" or "Sales Prospects"), then update those entities as easily as you'd submit a new log message. As an example, you could imagine a Customer entity with a last_login field that is updated every time a customer-login user_id=xxx log message is received.

We'd also allow you to link between the entities using their existing unique IDs.

This would allow the Pydantic Data Platform to be used as an admin view of your application data as well as a logging or BI tool.

Logging data from other sources

While opentelemetry has its deficiencies, it should allow us to receive data from many other technologies without the need to build integrations for every one.

In addition, we will build a first class API (OpenAPI documentation, nice error messages, all the stuff that you've come to love from FastAPI) to make direct integrations and other language SDKs easy to develop.


2. Data Gateway for object stores {#3-2-data-gateway-for-object-stores}

Add validation, transformation and cataloging in front of object stores like S3.

The idea is that we would bring the same declarative schema enforcement and cataloging that has made Pydantic and FastAPI so successful to other places. Putting a data validation and schema cataloging layer in front of data storage seems like a natural place for validation as a service.

Out of the box S3 is a key value store; you can't enforce that blobs under a certain path all have any specific schema, or even that they are all JSON files. The S3 console organizes keys into a folder structure based on delimiters, but when you navigate to a "folder" you know nothing about the data without opening samples or reviewing the source code that produced them.

Our Solution

We will build a scalable, performant service that sits between clients and the data sink — a data validation reverse proxy or gateway.

Schemas would be defined via Pydantic models, but the service would provide a number of features you don't get from Pydantic alone:

  • Many input formats would be supported (JSON, CSV, Excel, Parquet, Avro, msgpack, etc), with automatic conversion to the storage format
  • Multiple storage formats would be supported — at least JSON and Parquet. Delta Lake tables and Iceberg tables might come later
  • Multiple interfaces to upload and download data: HTTP API, S3 compliant API (so you can continue to use aws s3 cp etc.), Python SDK/ORM
  • Arbitrary binary data formats (images, videos, documents) would be supported; validations would include checking formats, resolution of images and videos, etc.
  • Over time we'll add features that take advantage of our knowledge of the schemas to improve costs/performance and the overall developer experience of S3.
  • Optionally, successful and failed uploads could be logged to the logging/analytics service described above

Because the validation and transformation would be implemented in Rust, and each process can perform validation for many different schemas, we will be able to provide this service at a significantly lower cost than running a Python service using Pydantic for validation.

Example

Let's say we want to upload data from user inquiries to a specific prefix, store it as JSON, this might look something like this:

Pydantic Gateway — JSON Dataset
from datetime import  datetime

from pydantic import BaseModel, EmailStr
from pydantic_gateway import JsonDataset  # name TBD

class CustomerInquiry(BaseModel):
    email: EmailStr
    name: str
    inquiry_ts: datetime

dataset = JsonDataset("my-bucket/inquiries/", CustomerInquiry)
dataset.register()
upload = dataset.create_upload_url(expiration=3_600)
print(upload.url)
#> https://gateway.pydantic.dev/<account-id>/<uuid>/?key=e3ca0f89etc

# validation directly from Python
# (note: validation would be run on the server, not locally)
dataset.upload({'email': '[email protected]', ...})

Data about inquiries could then be added via another service, the equivalent of:

$ curl -X POST \
    -d '{"email": "[email protected]"}' \
    'https://gateway.pydantic.dev/<account-id>/<uuid>/?key=e3ca0f89etc'

Or using awscli, and this time uploading multiple inquires from a CSV file:

$ aws s3 cp \
    --endpoint-url https://<account-id>.s3.pydantic.dev \
    inquiries.csv s3://<uuid>

The data could then be read from Python:

print(dataset.read(limit=100))
#> [CustomerInquiry(email='[email protected]', ...), ...]

The power here is that if the service submitting customer inquiries makes an error in the schema, the data is rejected or stored in a quarantine space for later review.

One of the most powerful tools that S3 provides is pre-signed URLs.

We'll provide support for pre-signed URLs and even expand support to creating upload endpoints for entire datasets. That means you'll be able to generate a pre-authorized URL that still retains all the data validation.

from pydantic_gateway import FileDataset, image_schema

profile_picture = image_schema(
    output_width=100,
    output_height=100,
    output_max_size=1_000,
    output_formats="png",
)

dataset = FileDataset(
    "my-bucket/profile-pics/",
    profile_picture
)
file_upload = dataset.create_file_upload_url(
    "john-doe.jpg",
    expiration=60
)
print(file_upload.url)  # return the pre-signed URL to the client
#> https://gateway.pydantic.dev/<account-id>/<uuid>/upload?path=/users/1.jpg&key=e3ca0f89etc
# wait for a client to upload the data before updating the users current picture
await file_upload.completed()
file_contents = await file_upload.download()
...

3. Data Gateway for data warehouses {#3-3-data-gateway-for-data-warehouses}

The components described above are useful well beyond object stores like S3.

We will also provide a similar service for data warehouses like Snowflake and BigQuery.

Basically, you give us a Pydantic Model and a table identifier, and we give you back an endpoint that you can POST to where, after validation, the data will be inserted as rows in the table.

from pydantic_gateway import BigQueryTable

dataset = BigQueryTable(
    "my-project.my-dataset.inquiries",
    CustomerInquiry
)
upload = dataset.create_upload_url(
    expiration=3_600,
)
print(upload.url)
#> https://gateway.pydantic.dev/<account-id>/<uuid>/upload?key=e3ca0f89etc

While there's value in providing validation on the front of a data warehouse, we know from talking to lots of teams about how they configure and operate their data warehouses and pipelines that this is only one of the challenges.

Longer term, we see significant value in providing a declarative way to define the transformations that should be applied to data as it moves within the data warehouse.


4. Schema Catalog {#3-4-schema-catalog}

One of the things we often hear when talking to engineers about how their organisation uses Pydantic, is that their highest-fidelity schemas across all their tools are often Pydantic models.

The problem is that, in contrast with the centralized nature of a relational database schema, these models are often scattered across multiple repositories. There's no easy way to find and use them, let alone keep track of changes in these schemas and how these changes might impact different parts of your organization.

Pydantic Schema Catalog would give you a single place to view schemas, including:

  • The Pydantic model code
  • Swagger/redoc documentation based on the JSON Schema — this provides a non-technical view of a schema to aid those who aren't (yet) comfortable reading Python
  • Data sources which use this schema
  • Links between this schema and other schemas, e.g. via foreign keys
  • Changes to the schema over time, including whether the change is backwards compatible and what migration logic is required
  • Together with the above components (Observability and Data Gateway), you could go straight from a schema to data stored with that schema, or vice versa

We will provide a web interface to view and manage schemas as well as a CLI to interact with the Schema Catalog.

Schema Catalog — Example
from datetime import datetime
from pydantic import BaseModel, EmailStr

class CustomerInquiry(BaseModel):
    email: EmailStr
    name: str
    inquiry_ts: datetime

import pydantic_schema_catalogue  # name TBD
pydantic_schema_catalogue.register(CustomerInquiry)

Later in another project...

$ pydantic-schema-catalogue list
...
# download the schema `User` as a Pydantic model
$ pydantic-schema-catalogue get --format pydantic User > user_model.py
# download the schema `User` as a Postgres SQL to create a table
$ pydantic-schema-catalogue get --format postgres User >> tables.sql

The Schema Catalog would integrate closely with other components described above:

  • schemas of models logged could be automatically registered in the Schema Catalog
  • a schema in the Schema Catalog would be used to create a validation endpoint with one or two clicks or a CLI command

Schema Inference

All too often, you have data without a schema, and reverse engineering a comprehensive schema is a painful, manual process.

Pydantic Schema Catalog would provide a way to infer a schema from a dataset, allowing you to initialize a new schema from a sample of data.

$ pydantic-schema-catalogue infer --name 'Customer Inquiry' inquiries.csv

5. Dashboards and UI powered by Pydantic models {#3-5-dashboards-and-ui-powered-by-pydantic-models}

One of the major goals of data collection is to derive insights from and make decisions based on the collected data.

But often the person responsible for those insights or decisions is outside your engineering team, sometimes outside your organization altogether.

So a flawlessly executing data pipeline populating your data warehouse isn't enough. You need a way to help the rest of your organization visualize and interact with the data.

But since this data visualization is often not your core business, you don't want to spend a week or month(!) building a dashboard, or maintain and extend it going forward.

Pydantic Dashboards will allow you to build UIs powered by Pydantic models, and python code, in minutes. We would take care of the hosting, scaling, and maintenance, as well as enforcing authentication.

Pydantic Dashboards would provide all the common UI components (tables, pagination, forms, charts) and use the ORM we build on top of the above components to provide a simple, but powerful, way to interact with your data.

Below is an example of how this might look, taking the "Customer Inquiries" example from above.

Pydantic Dashboard — Customer Inquiries
from datetime import datetime
from typing import Literal

from pydantic_dash import app, components
from pydantic_dash.auth import GoogleAuth
from pydantic_dash.responses import Redirect

from pydantic_db import DatabaseModel, DayAgg
from pydantic import EmailStr

app.auth(GoogleAuth(domain='my-company.com'))

class CustomerInquiry(DatabaseModel):
    email: EmailStr
    name: str
    inquiry_ts: datetime
    source: Literal['website', 'email', 'phone']

@app.view('/', buttons={
    '/new/': 'Create New',
    '/list/': 'View All'
})
def inquiries_analytics():
    # query the database for recent inquiries
    recent_inquiries = (
        CustomerInquiry.db_all()
        .order_by('inquiry_ts', desc=True)
        .limit(1_000)
    )
    # return two components: a pie chart and a bar chart
    return components.Row(
        # charts are designed alongside the ORM,
        # to use sensible defaults, e.g. here the charts
        # infer the `.count()` implicit in the `groupby`
        components.PieChart(
            recent_inquiries.groupby('source'),
            title='Inquiry Sources',
        ),
        components.BarChart(
            recent_inquiries.groupby(DayAgg('inquiry_ts')),
            x_label='Date',
            y_label='Inquiry Count',
        ),
    ),

# using a list_view here means the query returned is automatically
# rendered as a table and paginated
@app.list_view('/list/', buttons={'/new/': 'Create New'})
def inquiries_list():
    return CustomerInquiry.db_all().order_by('inquiry_ts', desc=True)

# form_view provides both GET and POST form endpoints
# the GET view renders an HTML form based on the `CustomerInquiry` model
# the POST view validates the request with the `CustomerInquiry` model,
# then calls the function
@app.form_view('/new/', CustomerInquiry)
def new_inquiry(inquiry: CustomerInquiry):
    inq_id = inquiry.db_save()
    return Redirect(f'/{inq_id}/')

This tool is not intended to replace the UI in your organization's main products, but there are many places in companies big and small where a managed tool like this with batteries included could cut the time required to build a dashboard or simple app from days to minutes.


Thanks for reading, we would really appreciate your feedback on these ideas! Please complete the survey or email [email protected] with your thoughts.