sop
Scalable Objects Persistence
Project maintained by SharedCode
Hosted on GitHub Pages — Theme by mattgraham
SOP Workflows: From Local Dev to Enterprise Swarm
SOP is designed to adapt to your project’s lifecycle. Whether you are a solo developer building a knowledge base or an architect designing a financial system, SOP offers flexible workflows that scale with you.
Here are the common implementation patterns.
1. The infs Path: File System Simplicity
The infs package uses the file system as the storage backend. While it works perfectly on a local disk for development, its true power is unlocked when using Network Attached Storage (NAS), S3-mounted drives, or Cloud Volumes. This allows your data to scale far beyond the limits of a single machine’s local disk.
Scenario A: The Seamless Scale-Up
Ideal for: Startups, internal tools, and applications that need to start simple but grow big.
- Develop Locally (Standalone Mode)
- Configure SOP to use Standalone Mode.
- Target a local folder or a mounted network drive.
- Benefit: Zero dependencies. No Redis to install. You can code, test, and debug on a plane without internet.
- Tip: Use the SOP Data Manager (
go run ./tools/httpserver) to visually inspect and manage your local data as you build.
- Release to Production (Clustered Mode)
- Mount that same network drive (or share the S3 bucket) to your production servers.
- Switch the configuration to Clustered Mode and point it to a Redis instance.
- Benefit: Your application instantly gains distributed coordination, locking, and caching. Multiple nodes can now read and write to the same data safely.
Scenario B: The “Build-Once, Read-Many” Engine
Ideal for: Knowledge Bases, AI RAG Stores, Static Content Delivery, Configuration Management.
- Build Phase
- A “Builder” process runs in Standalone Mode to generate or update the dataset (e.g., ingesting thousands of documents into a Vector Store).
- Serve Phase
- Deploy the dataset to your production cluster.
- Configure the reader applications to use
NoCheck Transactions.
- Benefit: Since the data is static (or rarely changes),
NoCheck skips the overhead of conflict detection and version tracking. You get raw, unbridled read speeds—perfect for high-traffic read endpoints.
Scenario C: General Purpose Enterprise Data
Ideal for: Financial Systems, Inventory Management, User Data, Transactional Logs.
- Configuration: Clustered Mode with full ACID transactions.
- Capabilities:
- Strict Financial Integrity: For banking and ledger applications, SOP’s ACID attributes guarantee that complex multi-step transactions (like transferring funds between accounts) either complete fully or not at all. This ensures zero data corruption and absolute consistency for critical financial data.
- Unlimited B-Trees: Create thousands of indexes or tables without overhead.
- O(log n) Search: Consistent, lightning-fast lookups regardless of data size.
- Ordered Range Queries: Unlike key-value stores, SOP keeps data sorted. You can efficiently scan ranges (e.g., “Get all orders from Jan 1 to Jan 31”) without full table scans.
- Mixed Workloads: Safely run Read-Only reporting jobs alongside heavy Read/Write transactional processes.
Scenario D: Swarm Computing
Ideal for: Massive ETL jobs, Scientific Computing, Global-Scale Data Processing.
- The Setup:
- High volume of Compute Nodes (The Swarm).
- Redis Cluster (The Coordinator).
- High-end Storage/Network (The Backend).
- The Outcome:
- SOP’s masterless architecture allows you to throw hardware at the problem.
- Thousands of nodes can process data in parallel, merging results in-flight.
- Zero Bottlenecks: There is no central “Master” node to choke the system. The limit is purely your hardware’s I/O capacity.
Ideal for: Video Streaming Services, Digital Asset Management, Medical Imaging Archives.
- The Challenge: S3 buckets are great for storage but poor for discovery. You often need a separate database (SQL/NoSQL) just to index the files, leading to synchronization headaches.
- The SOP Solution: Store the metadata (tags, timestamps, categories) and the blob data (video, audio, images) in the same SOP store.
- Capabilities:
- Rich Search: Use B-Tree range queries to find “All videos from 2023 tagged ‘Nature’ with duration > 5 mins”.
- Streaming Access: Use the
StreamingDataStore to stream gigabyte-sized files directly from the store without loading them entirely into memory.
- Unified Consistency: No more “ghost files” where the database record exists but the S3 file is missing. In SOP, the index and the data are updated atomically.
- Fine-Grained Data Mining: Perform complex analytics and filtering directly on the storage engine without needing an external indexer.
- Segment-Level Editing: SOP isn’t just for static files. You can perform CRUD operations on specific segments of a large blob. Need to replace a 5-minute chapter in a 2-hour movie? You can update just those specific chunks without rewriting the entire file. SOP handles this at scale, making it a powerful backend for non-linear video editing or dynamic content assembly.
Scenario F: Search Engines & AI Pipelines
Ideal for: Text Search, RAG (Retrieval-Augmented Generation), LLM Memory.
- The Need: Applications often require both a database (for truth) and a search engine (for discovery), plus a vector store (for AI). Managing three systems is complex.
- The SOP Solution: Consolidate them. SOP’s B-Trees handle the structured data and search capabilities, while its vector support handles the AI.
- Capabilities:
- ElasticSearch Alternative: Leverage SOP’s high-performance B-Trees for prefix search, range scans, and metadata filtering. It offers a simpler, ACID-compliant alternative to managing a separate Lucene/Elastic cluster.
- RAG Pipelines: Store document chunks and embeddings side-by-side. When an LLM needs context, retrieve the exact text segments instantly.
- Transactional AI: Ensure your AI’s “memory” is consistent. When you update a document, its vector embeddings and text chunks are updated in the same transaction, preventing hallucinations caused by stale data.
Scenario G: Desktop Publishing & Read-Only Distribution
Ideal for: E-Books, Legal Archives, Offline Encyclopedias, Shared Network Libraries.
- The Workflow:
- Authoring: An author uses a desktop app (powered by SOP) to write, edit, and organize thousands of chapters, images, and references.
- Publishing: The finished SOP data folder is “burned” onto read-only media (USB drives, CD-ROMs) or hosted on a read-only network share.
- Consumption: Users run a viewer app directly from the media or network drive.
- The SOP Advantage:
NoCheck Speed: Since the data is read-only, the viewer app runs in NoCheck transaction mode. This bypasses all locking and conflict detection, delivering instant search results and data retrieval directly from the disk or network.
- Zero Installation: No database server to install. The data is the database.
- Rich Search: Users can perform complex queries (e.g., “Find all legal precedents from 1990-1995 containing ‘Copyright’”) instantly, whether offline or online.
Scenario H: The Database Engine Construction Kit
Ideal for: Database Developers, Custom Query Languages, Specialized RDBMS Makers.
- The Opportunity: Building a database from scratch is hard. You need B-Trees, WAL, Locking, Transaction Managers, and Buffer Pools.
- The SOP Shortcut: SOP gives you all of that out of the box. It is a “Storage Engine Construction Kit”.
- Your Job: You focus on the high-level logic—writing the SQL Parser, Query Optimizer, or LINQ Provider.
- The Result: A custom, ACID-compliant database engine built in weeks, not years, with native support for Blobs, Vectors, and Distributed Transactions.
Scenario I: The Edge-to-Cloud Continuum
Ideal for: IoT Fleets, Connected Cars, Medical Devices, Smart Sensors.
- The Architecture:
- On the Edge: Your smart device (e.g., an MRI machine or an autonomous vehicle) runs SOP in Standalone Mode. It captures high-frequency sensor data, logs, and telemetry locally with extreme speed and reliability.
- In the Cloud: Your backend runs SOP in Clustered Mode (Swarm). It aggregates data from millions of devices.
- The Synergy:
- Unified Data Format: Since both ends use SOP, there is no complex ETL or serialization needed to move data from the car to the cloud. A B-Tree node on the device is compatible with the B-Tree in the cloud.
- Offline Autonomy: The device operates fully offline with ACID integrity. When connectivity is restored, it syncs the delta to the cloud.
Scenario J: Data Sovereignty & Compliance
Ideal for: GDPR, HIPAA, Government, Multi-Region Clouds.
- The Challenge: Laws often dictate that user data from Country X must physically reside in Country X. Managing this with a single monolithic database cluster is a nightmare of sharding and routing.
- The SOP Solution: Since SOP stores data in standard file folders (or S3 buckets), you can trivially map specific Tenants or Keyspaces to specific physical storage locations.
- Capabilities:
- Physical Isolation: “German Users” get stored in
/mnt/germany_disk, “US Users” in /mnt/us_disk. The application logic remains the same, but the physical storage is strictly separated.
- Portable Compliance: Need to move a dataset to an air-gapped server for an audit? Just copy the folder. The entire database moves with it.
- Granular Encryption: You can apply file-system level encryption to specific folders (tenants) without performance penalties on the rest of the system.
Scenario K: DevOps, Testing & Release Management
Ideal for: CI/CD Pipelines, Integration Testing, QA, Environment Promotion.
- The Problem: Spinning up a full Oracle/Postgres instance for every test run is slow. Promoting data from Staging to Production often involves risky migration scripts.
- The SOP Solution: SOP treats the database as a standard file folder.
- Capabilities:
- Instant Snapshots: Want to save the state of the database before a test?
cp -r data data_backup. Restore? cp -r data_backup data. It takes milliseconds.
- Parallel Execution: Run 50 test suites in parallel on the same machine. Just give each one a unique temp folder. No port conflicts, no shared state.
- Simple Promotion: Promoting a reference dataset (e.g., a product catalog) from Dev to QA to Production is just a file copy. You can literally “ship the database” as an artifact alongside your application binary.
Scenario L: Big Data & Analytics
Ideal for: Log Management, Audit Trails, IoT Telemetry, Large-Scale Document Stores.
- The Challenge: You have millions of records (e.g., JSON documents, images, or logs). You frequently need to filter them based on metadata (e.g., “Find all ‘Error’ logs from ‘Server-1’ in the last hour” or “List all ‘Active’ users”). Fetching the full record just to check a status flag is too slow and I/O intensive.
- The SOP Solution: Use “Ride-on” Keys and Complex Keys.
- Capabilities:
- Ride-on Metadata: Embed critical metadata (Status, Timestamp, Category) directly into the Key struct. SOP allows you to fetch only the keys from the B-Tree. This means you can scan millions of items per second to filter data without ever touching the heavy Value payload on disk.
- Complex Indexing: Define composite keys (e.g.,
Region -> Department -> EmployeeID) using standard structs. SOP automatically handles the sorting and indexing, allowing for efficient prefix scans and range queries across multiple dimensions.
- Soft Deletes: Implement “Trash Can” functionality by adding an
IsDeleted flag to your key. Your application can instantly filter out deleted items during a key scan without the performance hit of physical deletion or the complexity of a separate “tombstone” table.
2. The incfs Path: Cassandra Supercharged
The incfs package allows you to use Apache Cassandra as the registry backend while keeping data blobs on the file system.
Scenario: The Legacy Upgrade
Ideal for: Teams with existing or planned Cassandra infrastructure who need stronger consistency.
- The Problem: You have a Cassandra-based system that scales well, but you are struggling with eventual consistency, lack of transactions, or the inability to perform efficient range queries.
- The Fix: Integrate SOP using the
incfs package.
- The Result:
- ACID Transactions: You gain full commit/rollback capabilities on top of Cassandra.
- B-Tree Indexing: Data is now ordered and indexable, enabling queries that raw Cassandra struggles with.
- Large Item Management: SOP handles the chunking and management of large data blobs automatically.
- Infrastructure Reuse: You leverage your existing investment in Cassandra for reliability while gaining the features of a modern transactional engine.
3. Universal Interoperability
SOP is designed to be the universal storage layer for your entire stack, regardless of language or platform.
- Multi-Language Support:
- Native Go: The core library is written in Go for maximum performance and concurrency.
- Python Wrapper: Full access to SOP features for Data Science and AI workflows (
sop4py).
- C/C++ Integration: Can be linked as a shared library.
- Cross-Platform & Architecture:
- Write Once, Run Anywhere: Pre-compiled binaries and libraries are available for Linux, Windows, macOS, and Android.
- Hardware Agnostic: Runs efficiently on everything from ARM-based Raspberry Pis and mobile devices to x86_64 cloud servers.
- Mixed-Platform Clusters: You can build a single logical enterprise cluster using a mix of Linux servers, Windows workstations, and macOS nodes. SOP abstracts away the OS-level differences in DirectIO, File Sector I/O, and Locking. A lock acquired by a Linux node is respected by a Windows node (via Redis), and data written by one is readable by the other.
- Microservices & REST:
- The Universal API: For languages without native bindings (Java, C#, Node.js), SOP can be deployed as a lightweight RESTful Microservice.
- Polyglot Architecture: A Python AI agent, a Node.js web server, and a C# desktop app can all read/write to the same SOP cluster simultaneously with full transactional integrity.