Get super fast MLX models with vMLX, ideal for startup founders using MCP servers, built with Python.
642 stars70 forksPythonQuality 8/10Updated 6/11/2026100% free Β· open source
What it does
vMLX is a Python tool that provides an optimized and compressed model serving solution with L2 disk cache, L1 paged cache, and hybrid scheduler for fast and efficient model deployment
Install / run
git clone https://github.com/jjang-ai/vmlx && cd vmlx
When to use it
β’When you need to deploy large machine learning models and want to reduce memory usage
β’When you require fast model serving and inference with low latency
β’When you want to survive model serving interruptions, such as restarts, with a disk cache
Quick start
1Clone the vMLX repository and navigate to the project directory
2Install the required dependencies with `pip install -r requirements.txt`
3Configure the model serving settings in the `config.json` file
4Start the vMLX server with `python app.py`
5Test the model serving with a sample request using `curl` or a tool like Postman
Ready-to-paste prompt
curl -X POST -H 'Content-Type: application/json' -d '@input.json' http://localhost:8000/predict
Heads up: Make sure you have the correct Python version installed, as vMLX has specific dependencies and may not be compatible with all Python versions