Reproducible Data Science that Scales!

Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance.

Version control for data

Pachyderm version controls data, similar to what Git does with code. You can track the state of your data over time, backtest models on historical data, share data with teammates, and revert to previous states of data. Learn more →


Language-agnostic data pipelines

Pachyderm lets you use the tools and frameworks you need, from bash scripts to Tensorflow. You just declaratively tell Pachyderm what you want to run, and Pachyderm takes care of triggering, data sharding, parallelism, and resource management on the backend. Learn more →

Why Pachyderm?

Because data scientists should be able to focus on data science, not infrastructure


Consistently recreate results from any previous state of your data or analysis.

Data Provenance

Understand every step of the process that produced a given result.


Manage shared data resources and work more effectively as a team.


Build upon past results by only processing the new data for maximum performance.

Data Scientist Autonomy

Maintain complete control of your data science toolchain choices.

Infrastructure Agnostic

Run in the cloud or on-premise and integrate easily with your current infrastructure.

