sop

Scalable Objects Persistence


Project maintained by SharedCode Hosted on GitHub Pages — Theme by mattgraham

Operational Guide (DevOps)

This guide covers the operational aspects of running SOP in production, including failover handling, connection management, and backup strategies.

Connection Management

SOP relies on long-lived (pooled) connections to Redis and Cassandra (for incfs).

Note: If running in Standalone Mode (using sop.InMemory cache), Redis is not required, and this section can be ignored.

Note: SOP does not require Redis data persistence (RDB/AOF). Redis is used for ephemeral locking and caching. If Redis restarts, SOP detects the change and recovers safely.

Redis

Cassandra (incfs)

Failover Logic

SOP includes sophisticated logic to handle storage failures transparently.

“Failover Qualified” I/O Errors

Not all errors trigger a failover. SOP distinguishes between transient errors (retryable) and permanent hardware/filesystem failures.

Triggers for Failover:

Behavior:

  1. Detection: When a “Qualified” error occurs during a write.
  2. Switch: The system automatically marks the current storage path as “Passive” and switches to the configured “Active” standby path.
  3. Recovery: Operations continue on the new path. The failed path requires manual intervention or auto-repair (if configured).

Monitoring

Backup & Restore

Hybrid Backend (incfs)

Backing up a hybrid system requires coordination.

  1. Snapshot Registry (Cassandra):
    • Use nodetool snapshot to capture the state of the Cassandra keyspace.
  2. Snapshot Blob Store (Filesystem):
    • Use filesystem snapshots (e.g., ZFS, LVM, or cloud volume snapshots) to capture the data directory.
  3. Consistency:
    • Ideally, pause writes during the snapshot window to ensure the Registry and Blob Store are perfectly aligned.
    • If zero-downtime is required, snapshot Cassandra first, then the Filesystem. SOP’s Copy-On-Write nature means old blobs (referenced by the older Cassandra snapshot) will still exist on disk, ensuring a consistent point-in-time restore.

Restore Procedure

  1. Stop the application.
  2. Restore the Cassandra keyspace.
  3. Restore the Filesystem data.
  4. Start the application.
  5. Run Integrity Check: Use SOP’s internal tools to verify that all Registry entries point to valid blobs.

Data Management Suite (SOP Web UI)

SOP includes a powerful HTTP Server and Web UI that functions as a full database management suite. It allows you to:

Note on Architecture: This tool is not a central database server. In SOP’s masterless architecture, this UI is simply another client node. You can run it locally on your laptop to manage a remote production cluster, or deploy it as a sidecar. It connects directly to the storage layer, respecting all ACID guarantees without introducing a central bottleneck. Each user managing data via this app participates in “swarm” computing, where changes are efficiently merged or rejected (if conflicting) with full ACID guarantees.

Running the Management Suite

You can run the tool directly from the source:

# Point to your SOP registry folder
go run ./tools/httpserver -registry /path/to/your/sop/data

Access the UI at http://localhost:8080.

Key Features

For more details, see the SOP Data Manager Documentation.

Troubleshooting & Best Practices

Clustered Mode: Data Deletion

When running SOP in Clustered Mode (using Redis + Disk Storage), it is critical to maintain synchronization between the persistent data on disk and the ephemeral locks/cache in Redis.

Recommended Practice:

Manual Deletion (Development Only): If you must delete the data files manually (e.g., during local development or a hard reset):

  1. Stop all Applications: Ensure no SOP processes (services, CLI, Data Manager) are running.
  2. Delete Files: Remove the store folders/files from the disk.
  3. Flush Redis: You MUST run FLUSHALL on your Redis instance immediately after deleting the files.
    redis-cli flushall
    

Why this is necessary: Redis maintains locks and cached metadata with a default timeout (typically 15 minutes). If you delete the files but leave the Redis keys active, any new application instance you start will see “ghost” locks or stale metadata, preventing it from recreating the stores or acquiring locks until the timeout expires.