Scalable Objects Persistence
Scalable Objects Persistence (SOP) is a high-performance, transactional storage engine for Python, powered by a robust Go backend. It combines the raw speed of direct disk I/O with the reliability of ACID transactions and the flexibility of modern AI data management.
SOP is designed for high-throughput, low-latency scenarios, making it suitable for “Big Data” management on commodity hardware.
IsDeleted, LastUpdated, Category) directly into the Key struct but excluding it from the index (using IndexSpecification), you can scan millions of keys per second to filter data. This avoids the I/O penalty of fetching the full Value (which might be a large JSON blob or binary file) just to check a status flag.The SOP AI Kit transforms SOP from a storage engine into a complete AI data platform.
Important: To use the AI Copilot features (e.g., in the Data Manager), you must configure your LLM API Key (e.g., Google Gemini). Set the
SOP_LLM_API_KEYenvironment variable or add"llm_api_key"to yourconfig.json. See the Main README for details.
For comprehensive details on the SOP Platform Tools (Scripting, Explain Plans, Self-Correcting Agents), please see the Platform Tools Documentation.
See ai/README.md for a deep dive into the AI capabilities.
Install directly from PyPI:
pip install sop4py
SOP includes a powerful Data Management Suite that provides full CRUD capabilities for your B-Tree stores. It goes beyond simple viewing, offering a complete GUI for inspecting, searching, and managing your data at scale.
To launch the Data Manager simply run:
sop-httpserver
Or download the all-in-one single-file installer (no Python/pip required) from SOP Releases.
Country + City).Usage: By default, it opens on http://localhost:8080.
Arguments: You can pass standard flags, e.g., sop-httpserver -port 9090 -database ./my_data.
For managing multiple environments (e.g., Dev, Staging, Prod), create a config.json:
{
"port": 8080,
"databases": [
{
"name": "Local Development",
"path": "./data/dev_db",
"mode": "standalone"
},
{
"name": "Production Cluster",
"path": "/mnt/data/prod",
"mode": "clustered",
"redis": "redis-prod:6379"
}
],
"system_db": {
"name": "system",
"path": "./data/sop_system",
"mode": "standalone"
}
}
Note: This example shows the structure of
system_db, but it is best to let the Data Manager Setup Wizard create and populate it automatically on first launch. The Wizard ensures that essential stores (likeScriptandllm_knowledge) are correctly initialized for the AI Copilot.
Run with: sop-httpserver -config config.json
If database(s) are configured in standalone mode, ensure that the http server is the only process/app running to manage the database(s). Alternatively, you can add its HTTP REST endpoint to your embedded/standalone app so it can continue its function and serve HTTP pages at the same time.
If clustered, no worries, as SOP takes care of Redis-based coordination with other apps and/or SOP HTTP Servers managing databases using SOP in clustered mode.
The SOP Data Manager includes a built-in AI Copilot that allows you to interact with your data using natural language and automate workflows using Scripts.
Start the server:
sop-httpserver
Open your browser to http://localhost:8080 and click the AI Copilot floating widget.
You can ask the assistant to perform tasks or query data:
Scripts allow you to record a sequence of actions and replay them later. This is a “Natural Language Programming” system where the LLM compiles your intent into a high-performance script.
Step 1: Record
Type /script new <name> in the chat.
/script new daily_check
Step 2: Perform Actions Interact with the AI naturally.
Check the 'logs' store for errors.
Count the number of active users.
Step 3: Stop Save the script.
/script stop
Step 4: Replay Execute the script instantly. The system runs the compiled steps without invoking the LLM again.
/script run daily_check
You can make scripts dynamic by using parameters.
/script run user_audit user_id=456
You can trigger these scripts from your Python code via the REST API. This is similar to calling a Stored Procedure where you pass the procedure name and arguments.
import requests
# Execute the 'user_audit' script with parameters
response = requests.post(
"http://localhost:8080/api/scripts/execute",
json={
"name": "user_audit",
"category": "general",
"args": {
"user_id": 999
}
}
)
# The response is a JSON list of execution steps and results
results = response.json()
for step in results:
if "final_output" in step:
print("Result:", step["final_output"])
To see the Data Management Suite in action, you can generate a sample database with complex keys using the included example script:
# If installed via pip
sop-demo run large_complex_demo
# Or manually if you have the source
python3 examples/large_complex_demo.py
This will create a database in data/large_complex_db with two stores: people (Complex Key) and products (Composite Key).
sop-httpserver -database data/large_complex_db
SOP comes with a bundled CLI tool sop-demo to easily list and run examples directly from your installation.
List available examples:
sop-demo list
Run a specific example:
sop-demo run vector_demo
Copy examples to your workspace: If you want to inspect the code or modify the examples, you can copy them to your local directory:
sop-demo copy
# Copies to ./sop_examples/
Manual Execution: If you have copied the examples locally, you can also run them using python directly:
python3 sop_examples/concurrent_demo.py
Concurrent Transactions (Standalone): This demo shows how to run concurrent transactions without a Redis dependency. It simulates real-world scenarios by introducing a small random sleep interval (jitter) between batch transactions to mimic network latency and reduce contention.
sop-demo run concurrent_demo_standalone
Concurrent Transactions (Clustered): This demo shows how to run concurrent transactions in a distributed environment (requires Redis). Similar to the standalone demo, it uses jitter to simulate realistic commit timing across different machines in a cluster.
sop-demo run concurrent_demo
Vector Search:
sop-demo run vector_demo
See the examples/ directory for more scripts.
```
export PYTHONPATH=$PYTHONPATH:$(pwd)/jsondb/python
SOP uses a unified Database object to manage all types of stores (Vector, Model, and B-Tree). All operations are performed within a Transaction.
First, create a Context and open a Database connection.
from sop import Context, TransactionMode, TransactionOptions, Btree, BtreeOptions, Item
from sop.ai import Database, DatabaseType, Item as VectorItem
from sop.database import DatabaseOptions
# Initialize Context
ctx = Context()
# Open Database (Standalone Mode)
# This creates/opens a database at the specified path.
db = Database(DatabaseOptions(stores_folders=["data/my_db"], type=DatabaseType.Standalone))
# Open Database (Clustered Mode with Multi-Tenancy)
# Connects to a specific Cassandra Keyspace ("tenant_1").
# Requires Cassandra and Redis.
# db_clustered = Database(DatabaseOptions(stores_folders=["data/blobs"], keyspace="tenant_1", type=DatabaseType.Clustered))
All data operations (Create, Read, Update, Delete) must happen within a transaction.
# Begin a transaction (Read-Write)
# You can use 'with' block for auto-commit/rollback, or manage manually.
with db.begin_transaction(ctx) as tx:
# --- 3. Vector Store (AI) ---
# Open a Vector Store named "products"
vector_store = db.open_vector_store(ctx, tx, "products")
# Upsert a Vector Item
vector_store.upsert(ctx, VectorItem(
id="prod_101",
vector=[0.1, 0.5, 0.9],
payload={"name": "Laptop", "price": 999}
))
# --- 4. Model Store (AI) ---
# Open a Model Store named "classifiers"
model_store = db.open_model_store(ctx, tx, "classifiers")
# Save a Model
model_store.save(ctx, "churn", "v1.0", {
"algorithm": "random_forest",
"trees": 100
})
# --- 5. B-Tree Store (Key-Value) ---
# Open a B-Tree named "users"
# Use new_btree to create a new store, or open_btree for existing ones.
# BtreeOptions.name is optional if you pass the name directly to new_btree.
btree = db.new_btree(ctx, "users", tx)
# Add a Key-Value pair
btree.add(ctx, Item(key="user_123", value="John Doe"))
# Find a value
if btree.find(ctx, "user_123"):
# Fetch the value
items = btree.get_values(ctx, Item(key="user_123"))
if items and items[0].value:
print(f"Found User: {items[0].value}")
# --- 6. Complex Keys (Structs) ---
# Define a composite key using a dataclass
from dataclasses import dataclass
from sop.btree import IndexSpecification, IndexFieldSpecification
@dataclass
class EmployeeKey:
region: str
department: str
id: int
# Create B-Tree with custom index (Region -> Dept -> ID)
# This enables fast prefix scans (e.g., "Get all employees in US")
spec = IndexSpecification(index_fields=(
IndexFieldSpecification("region", ascending_sort_order=True),
IndexFieldSpecification("department", ascending_sort_order=True),
IndexFieldSpecification("id", ascending_sort_order=True)
))
# Pass spec as index_spec argument
employees = db.new_btree(ctx, "employees", tx, index_spec=spec)
# Add item with complex key
employees.add(ctx, Item(
key=EmployeeKey("US", "Sales", 101),
value={"name": "Alice"}
))
# --- 7. Simplified Lookup (Dictionary Keys) ---
# You can search for items using a plain dictionary, without needing the original dataclass.
# This is useful for consumer apps that just need to read data.
# Open existing B-Tree (no IndexSpec needed, it's loaded from disk)
employees_read = db.open_btree(ctx, "employees", tx)
# Search using a dict matching the key structure
if employees_read.find(ctx, {"region": "US", "department": "Sales", "id": 101}):
print("Found Alice!")
# --- 8. Text Search ---
# Open a Search Index
idx = db.open_search(ctx, "articles", tx)
idx.add("doc1", "The quick brown fox")
# Transaction commits automatically here.
# If an exception occurs, it rolls back.
You can perform queries in a separate transaction (e.g., Read-Only).
# Begin a Read-Only transaction (optional optimization)
with db.begin_transaction(ctx, mode=TransactionMode.ForReading.value) as tx:
# --- Vector Search ---
vs = db.open_vector_store(ctx, tx, "products")
hits = vs.query(ctx, vector=[0.1, 0.5, 0.8], k=5)
for hit in hits:
print(f"Vector Match: {hit.id}, Score: {hit.score}")
# --- Model Retrieval ---
ms = db.open_model_store(ctx, tx, "classifiers")
model = ms.get(ctx, "churn", "v1.0")
print(f"Loaded Model: {model['algorithm']}")
# --- B-Tree Lookup ---
us = db.open_btree(ctx, "user_store", tx)
if us.find(ctx, "user1"):
# Fetch the current item
item = us.get_current_item(ctx)
print(f"User Found: {item.value}")
Performance Tip: For Vector Search workloads that are “Build-Once-Query-Many”, use TransactionMode.NoCheck. This bypasses transaction overhead for maximum query throughput.
# High-performance Vector Search (No ACID checks)
with db.begin_transaction(ctx, mode=TransactionMode.NoCheck.value) as tx:
vs = db.open_vector_store(ctx, tx, "products")
hits = vs.query(ctx, vector=[0.1, 0.5, 0.8], k=5)
You can configure the internal logging of the SOP engine (Go backend) to output to a file or standard error, and control the verbosity.
from sop import Logger, LogLevel
# Configure logging to a file with Debug level
Logger.configure(LogLevel.Debug, "sop_engine.log")
# Or configure logging to stderr (default) with Info level
Logger.configure(LogLevel.Info)
You can configure timeouts, isolation levels, and more.
from sop import TransactionOptions
opts = TransactionOptions(
max_time=15, # 15 minutes timeout
)
tx = db.begin_transaction(ctx, options=opts)
For distributed deployments, switch to DatabaseType.Clustered. This requires Redis for coordination.
from sop.ai import DatabaseType
db = Database(
ctx,
stores_folders=["/mnt/shared_data"],
type=DatabaseType.Clustered
)
To ensure your Python-created databases are visible and fully manageable in the SOP Data Manager (GUI), you should use the setup method during initialization. This persists your configuration options (like store paths, erasure coding settings, etc.) so the UI can discover them.
# 1. Define Options
options = DatabaseOptions(
stores_folders=["./data/my_db"],
type=DatabaseType.Standalone
)
# 2. Persist Options (One-time setup or on startup)
# This saves 'dboptions.json' in the database folder
Database.setup(ctx, options)
# 3. Initialize
db = Database(options)
Later, you (or the Data Manager) can inspect these options using get_options:
# Retrieve config from a path
opts = Database.get_options(ctx, "./data/my_db")
print(f"Database Type: {opts.type}")
For production environments using Clustered mode, you should initialize both Cassandra (for storage) and Redis (for distributed locking and caching) at application startup.
from sop import Redis
from sop.cassandra import Cassandra
from sop.database import Database, DatabaseOptions, DatabaseType
# 1. Initialize Redis (Required for Locking/Caching in Clustered mode)
# Format: redis://<user>:<password>@<host>:<port>/<db_number>
Redis.initialize("redis://:password@localhost:6379/0")
# 2. Initialize Cassandra (Global Connection)
Cassandra.initialize({
"cluster_hosts": ["127.0.0.1"],
"consistency": 1, # 1 = LocalQuorum
"authenticator": {
"username": "cassandra",
"password": "password"
}
})
# ... Application Logic ...
# Connect to a specific tenant's keyspace
db = Database(DatabaseOptions(
keyspace="tenant_1",
type=DatabaseType.Clustered
))
# ...
# Cleanup on shutdown
Redis.close()
Cassandra.close()
SOP uses a split architecture:
.dylib, .so, .dll).ctypes to interface with the Go engine, providing a Pythonic API (sop package).Contributions are welcome! Please check the CONTRIBUTING.md file in the repository for guidelines.