Garbage in →

Pydantic →

you're golden!

by

Samuel Colvin

PyData London -  2023/06/03

https://london2023.pydata.org/cfp/talk/SFZLT7/

Today

  • What is Pydantic, why do people seem to like it?
  • Trouble in paradise
  • Rust to the rescue - Good, Bad, Ugly
  • Examples of how Rust helps Pydantic V2 solve your problems
  • Live demo!

Pydantic

from datetime import datetime
from pydantic import BaseModel


class Talk(BaseModel):
    title: str
    when: datetime | None = None
    mistakes: list[str]

Just type hints get you:

  • Validation
  • Coercion/tranformation
  • Serialization
  • JSON Schema

You people seemed to like it:

  • 58m downloads/mo
  • used by all FAANG companies
  • 12% of pro web developers

30s - understand

3m - useful

300hr - usable

Empathy for the developers using our library

But there's a problem...

Pydantic V2

Priorities for V2:

  • Performance - it was good, but could be better - think of the penguins!
  • Strict Mode - live up to the name
  • Composability - you don't always want a model
  • Maintainability - I maintain Pydantic so I want maintaining Pydantic to be fun

Sad penguin, no snow

What would it look like if we started from scratch?

What about Rust?

The obvious advantages...

 

  • Performance
  • Multithreading - no GIL
  • Reusing high quality rust libraries
  • More explicit error handling

 

(maybe) Less obviously advantages:

  • Virtually zero cost customisation, even in hot code
  • Arguably easier to maintain - the compiler picks up more of mistake

Rust - the good

But perhaps most pertinent to Pydantic...

from pydantic import BaseModel

class Qualification(BaseModel):
    name: str
    description: str
    required: bool
    value: int


class Student(BaseModel):
    id: int
    name: str
    qualifications: list[Qualification]
    friends: list[int]
[
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
    ...,
]

Rust loves this

  • Deeply recursive code - no stack frames
  • Small modular components

How Rust?

What does that tree look like?

class Talk(BaseModel):
    title: Annotated[
        str,
        Maxlen(100)
    ]
    attendance: PosInt
    when: datetime | None = None
    mistakes: list[
        tuple[timedelta, str]
    ]
ModelValidator {
  cls: Talk,
  validator: TypeDictValidator [
    Field {
      key: "title",
      validator: StrValidator { max_len: 100 },
    },
    Field {
      key: "attendance",
      validator: IntValidator { min: 0 },
    },
    Field {
      key: "when",
      validator: UnionValidator [
        DateTimeValidator {},
        NoneValidator {},
      ],
      default: None,
    },
    Field {
      key: "mistakes",
      validator: ListValidator {
        item_validator: TupleValidator [
          TimedeltaValidator {},
          StrValidator {},
        ],
      },
    },
  ],
}

Python Interface to Rust

from pydantic_core import SchemaValidator


class Talk:
    ...

talk_validator = SchemaValidator({
    'type': 'model',
    'cls': Talk,
    'schema': {
        'type': 'typed-dict',
        'fields': {
            'title': {'schema': {'type': 'str', 'max_length': 100}},
            'attendance': {'schema': {'type': 'int', 'ge': 0}},
            'when': {
                'schema': {
                    'type': 'default',
                    'schema': {'type': 'nullable', 'schema': {'type': 'datetime'}},
                    'default': None,
                }
            },
            'mistakes': {
                'schema': {
                    'type': 'list',
                    'items_schema': {
                        'type': 'tuple',
                        'mode': 'positional',
                        'items_schema': [{'type': 'timedelta'}, {'type': 'str'}]
                    }
                }
            },
        },
    }
})

some_data = {
    'title': "How Pydantic V2 leverages Rust's Superpowers",
    'attendance': '100',
    'when': '2023-04-22T12:15:00',
    'mistakes': [
        ('00:00:00', 'Screen mirroring confusion'),
        ('00:00:30', 'Forgot to turn on the mic'),
        ('00:25:00', 'Too short'),
        ('00:40:00', 'Too long!'),
    ],
}
talk = talk_validator.validate_python(some_data)
print(talk.mistakes)
"""
[
    (datetime.timedelta(0), 'Screen mirroring confusion'), 
    (datetime.timedelta(seconds=30), 'Forgot to turn on the mic'), 
    (datetime.timedelta(seconds=1500), 'Too short'), 
    (datetime.timedelta(seconds=2400), 'Too long!')
]
"""
class Talk(BaseModel):
    title: Annotated[
        str,
        Maxlen(100)
    ]
    attendance: PosInt
    when: datetime | None = None
    mistakes: list[
        tuple[timedelta, str]
    ]

Pydantic V2 Architecture

Read type hints

construct a "core schema"

pydantic

(pure python)

pydantic-core

(binary + stubs + core-schema)

process core schema

return SchemaValidator

Receive data

call schema_validator(data)

run validator

return the result of validation

Rust - the bad

from __future__ import annotations
from pydantic import BaseModel


class Foo(BaseModel):
    a: int
    f: list[Foo]


f = {'a': 1, 'f': []}
f['f'].append(f)
Foo(**f)
fn main() {
    main();
}

RecursionError is bad, but no RecursionError is worse!

Also no multiple ownership.

Rust - the ugly

class Box:
    def __init__(self, width):
        self.width = width

    def area(self):
        return self.width ** 2

    def __str__(self):
        return f'Box: {self.width}'

box = Box(42)
print(f'{box}, area {box.area()}')
use std::fmt;

struct Box {
    width: i64,
}

impl fmt::Display for Box {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "Box: {}", self.width)
    }
}

impl Box {
    fn new(width: i64) -> Self {
        Self { width }
    }

    fn area(&self) -> i64 {
        self.width * self.width
    }
}

fn main() {
    let b = Box::new(42);
    println!("{b}, area {}", b.area());
}

Rust is significantly more verbose.

Pydantic V2

Examples

Performance

import timeit
from pydantic import BaseModel, __version__

class Model(BaseModel):
    name: str
    age: int
    friends: list[int]
    settings: dict[str, float]

data = {
    'name': 'John',
    'age': 42,
    'friends': list(range(200)),
    'settings': {f'v_{i}': i / 2.0 for i in range(50)}
}
t = timeit.timeit(
    'Model(**data)',
    globals={'data': data, 'Model': Model},
    number=10_000,
)
print(f'version={__version__} time taken {t * 100:.2f}us')
version=1.10.4 time taken 179.81us
version=2.0a3  time taken   7.99us

22.5x speedup

Strict Mode

from pydantic import BaseModel, ValidationError

class Model(BaseModel):
    model_config = dict(strict=True)
    
    age: int
    friends: tuple[int, int]

try:
    Model(age='42', friends=[1, 2])
except ValidationError as e:
    print(e)
    """
    2 validation errors for Model
    age
      Input should be a valid integer [type=int_type, 
        input_value='42', input_type=str]
    friends
      Input should be a valid tuple [type=tuple_type, 
        input_value=[1, 2], input_type=list]
    """

print(Model(age=42, friends=(1, 2)))
#> age=42 friends=(1, 2)

AKA Pedant mode.

Builtin JSON parsing

from pydantic import BaseModel

class Model(BaseModel):
    model_config = dict(strict=True)
    age: int
    friends: tuple[int, int]

print(Model.model_validate_json('{"age": 1, "friends": [1, 2]}'))
#> age=1 friends=(1, 2)

If you're going to be a pedant, you better be right.

 

Also gives us:

  • Big performance improvement without 3rd party parsing library
  • Custom Errors (WIP)
  • Line numbers in errors (in future)

Wrap Validators

from pydantic import BaseModel, field_validator

class Model(BaseModel):
    x: int

    @field_validator('x', mode='wrap')
    def validate_x(cls, v, handler):
        if v == 'one':
            return 1

        try:
            return handler(v)
        except ValueError:
            return -999

print(Model(x='one'))
#> x=1
print(Model(x=2))
#> x=2
print(Model(x='three'))
#> x=-999
  • Logic before
  • Logic after
  • Catch errors - new error, or default

AKA "The Onion"

Recursive Models

from __future__ import annotations
from pydantic import BaseModel, Field, ValidationError

class Branch(BaseModel):
    length: float
    branches: list[Branch] = Field(default_factory=list)

print(Branch(length=1, branches=[{'length': 2}]))
#> length=1.0 branches=[Branch(length=2.0, branches=[])]

b = {'length': 1, 'branches': []}
b['branches'].append(b)

try:
    Branch.model_validate(b)
except ValidationError as e:
    print(e)
    """
    1 validation error for Branch
    branches.0
      Recursion error - cyclic reference detected 
        [type=recursion_loop, 
         input_value={'length': 1, 'branches': [{...}]}, 
         input_type=dict]
    """

Alias Paths

from pydantic import BaseModel, Field, AliasPath, AliasChoices


class MyModel(BaseModel):
    a: int = Field(validation_alias=AliasPath('foo', 1, 'bar'))
    b: str = Field(validation_alias=AliasChoices('x', 'y'))


m = MyModel.model_validate(
    {
        'foo': [{'bar': 0}, {'bar': 1}],
        'y': 'Y',
    }
)
print(m)
#> a=1 b='Y'

Generics

from typing import Generic, TypeVar

from pydantic import BaseModel

DataT = TypeVar('DataT')

class Response(BaseModel, Generic[DataT]):
    error: int | None = None
    data: DataT | None = None

class Profile(BaseModel):
    name: str
    email: str

def my_profile_view(id: int) -> Response[Profile]:
    if id == 42:
        return Response[Profile](data={'name': 'John', 'email': 'john@example.com'})
    else:
        return Response[Profile](error=404)

print(my_profile_view(42))
#> error=None data=Profile(name='John', email='john@example.com')
Favorite = tuple[int, str]

def my_favorites_view() -> Response[list[Favorite]]:
    return Response[list[Favorite]](data=[(1, 'a'), (2, 'b')])

Serialisation

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

class Profile(BaseModel):
    account_id: int
    user: User

user = User(name='Alice', age=1)
print(Profile(account_id=1, user=user).model_dump())
#> {'account_id': 1, 'user': {'name': 'Alice', 'age': 1}}

class AuthUser(User):
    password: str

auth_user = AuthUser(name='Bob', age=2, password='very secret')
print(Profile(account_id=2, user=auth_user).model_dump())
#> {'account_id': 2, 'user': {'name': 'Bob', 'age': 2}}

Solving the "don't ask the type" problem.

Without BaseModel

from dataclasses import dataclass
from pydantic import TypeAdapter

@dataclass
class Foo:
    a: int
    b: int

@dataclass
class Bar:
    c: int
    d: int

x = TypeAdapter(Foo | Bar)
d = x.validate_json('{"a": 1, "b": 2}')

print(d)
#> Foo(a=1, b=2)

print(x.dump_json(d))
#> b'{"a":1,"b":2}'

BaseModel is still here and widely used, but no longer essentials.

Enter TypeAdapter.

Demo

  • Needed to move off Google Analytics
  • Record page views without a cookie
  • Store in MongoDB
  • End up with a big JSON file to analyse
  • Want to see which pages are viewed most

Thank you

Twitter: @pydantic &  @samuel_colvin

GitHub: /pydantic & ​​​​​​​/samuelcolvin

Docs: docs.pydantic.dev

We need your help:

  • Try pydantic V2 beta before we release V2!
  • Applications using Pydantic - come talk to me
  • Are you using Pydantic to process lots of data - if so we'd love to chat to you about the commercial platform we're building

Not Rust vs. Python

But rather: Python as the user* interface for Rust.

(* by user, I mean "application developer")

 

I'd love to see a generation of libraries for Python (and other high level languages) built in Rust.

Rust

TLS

Routing

HTTP parsing

Validation

DB query

Serializing

Rust/C

Python

Application Logic

HTTPS request lifecycle:

100% of Developer time

=

1% of CPU cycles

...

Ok, some actual Rust...

Pydantic V2

#[enum_dispatch(CombinedValidator)]
trait Validator {
    const EXPECTED_TYPE: &'static str;

    fn build(schema: &PyDict, config: Option<&PyDict>) -> PyResult<CombinedValidator>;

    fn validate(&self, input: &impl Input, extra: &Extra) -> ValResult<PyObject>;
}

#[enum_dispatch]
enum CombinedValidator {
    Int(IntValidator),
    Str(StrValidator),
    TypedDict(TypedDictValidator),
    Union(UnionValidator),
    TaggedUnion(TaggedUnionValidator),
    Nullable(NullableValidator),
    // ... and 43 more
}

fn build_validator(schema: &PyDict, config: Option<&PyDict>) -> PyResult<CombinedValidator> {
    let schema_type: &str = schema.get_as_req("type")?;
    // really this is a clever macro to avoid the duplication
    match schema_type {
        IntValidator::EXPECTED_TYPE => IntValidator::build(schema, config),
        StrValidator::EXPECTED_TYPE => StrValidator::build(schema, config),
        TypedDictValidator::EXPECTED_TYPE => TypedDictValidator::build(schema, config),
        UnionValidator::EXPECTED_TYPE => UnionValidator::build(schema, config),
        TaggedUnionValidator::EXPECTED_TYPE => TaggedUnionValidator::build(schema, config),
        NullableValidator::EXPECTED_TYPE => NullableValidator::build(schema, config),
        // ... and 43 more
    }
}

trait Input<'a> {
    fn is_none(&self) -> bool;

    fn strict_str(&'a self) -> ValResult<&'a str>;

    fn lax_str(&'a self) -> ValResult<&'a str>;

    fn validate_date(&self, strict: bool) -> ValResult<PyDatetime>;

    fn strict_date(&self) -> ValResult<PyDatetime>;

    // ... and 53 more
}

impl<'a> Input<'a> for PyAny {
    // ...
}

impl<'a> Input<'a> for JsonInput {
    // ...
}

#[pyclass]
struct SchemaValidator {
    validator: CombinedValidator,
}

#[pymethods]
impl SchemaValidator {
    #[new]
    fn py_new(schema: &PyDict, config: Option<&PyDict>) -> PyResult<Self> {
        // We also do magic/evil schema validation using pydantic-core itself
        let validator = build_validator(schema, config)?;
        Ok(SchemaValidator { validator })
    }

    fn validate_python(&self, input: &PyAny, strict: Option<bool>) -> PyResult<PyObject> {
        self.validator.validate(input, &Extra::new(strict))
    }

    fn validate_json(
        &self,
        input_string: &PyString,
        strict: Option<bool>,
    ) -> PyResult<PyObject> {
        let input = parse_string(input_string)?;
        self.validator.validate(&input, &Extra::new(strict))
    }
}

Pydata London | Garbage in -> Pydantic -> you're golden!

By Samuel Colvin

Pydata London | Garbage in -> Pydantic -> you're golden!

  • 1,624