SAM3 in Annolid¶
This guide documents the SAM3/SAM3.1 integration in Annolid for video segmentation and tracking.
What Annolid uses¶
Annolid uses the bundled SAM3 runtime under:
annolid/segmentation/SAM/sam3/sam3
and integrates it through:
annolid/segmentation/SAM/sam3/adapter.pyannolid/segmentation/SAM/sam3/session.pyannolid/gui/widgets/sam3_manager.py
The runtime is configured for inference only (no training/eval workflow required).
Installation¶
Install Annolid with SAM3 extras:
pip install ".[sam3]"
Minimum practical runtime dependencies include:
torchiopathftfy
If a dependency is missing, Annolid raises a startup error from the SAM3 manager/session path.
Input modes¶
SAM3 runs in two main modes.
- Seeded mode (annotation-guided):
- Uses existing per-frame prompts when available.
- Supported prompt shapes: boxes/rectangles, polygons/masks, and points.
-
Polygon prompts are converted to robust box/mask seeds for SAM3 propagation.
-
Text-only mode:
- Uses a text prompt (for example,
mouse) and runs SAM3.1 windowed propagation. - This is the fallback when no usable geometric prompts (box/point/polygon) exist.
Annolid Bot workflow for long videos¶
If you want Annolid Bot to do the tracking for you, use the bot-facing long-video tool:
sam3_agent_video_track
Use this when you need the agent to refine each window before propagation. It is the best fit for:
- long recordings that drift with a single pass
- repeated appearances after occlusion
- multiple objects that need stable identities across windows
Recommended workflow:
- Inspect the source video first if you do not know its frame count or FPS.
- In Annolid Bot, call
video_info(path)first. - Run a dry run to confirm the output folder and settings.
- Set
dry_run=true. - Run the full SAM3 agent-seeded tracking pass.
- Review the returned JSON summary and the saved artifact path.
Example bot call:
sam3_agent_video_track(
video_path="/path/to/video.mp4",
agent_prompt="mouse",
window_size=5,
stride=5,
propagation_direction="forward",
dry_run=true
)
When the settings look right, rerun with dry_run=false.
Output artifacts¶
The tool returns a JSON summary and writes the same summary to disk.
Default output folder:
<video_stem>_sam3_agent/
Default summary filename:
<video_stem>_sam3_agent_tracking.json
The summary includes:
frames_processedmasks_writtensummary_path- the resolved runtime config
If you specify output_dir, Annolid writes the summary there instead.
Choosing the right tool¶
Use sam3_agent_video_track when:
- you want agent-assisted long-video tracking
- the object may disappear and reappear
- you want overlap-aware carry-over between windows
Use gui_segment_track_video when:
- you are working interactively inside the GUI
- you want to open a video and run the standard GUI-assisted segment/track flow
Use video_run_model_inference when:
- you only want to run a normal model predict command on a video
- you do not need SAM3 agent reseeding or window overlap carry-over
GUI slash track¶
If you are working in the Annolid GUI chat box, the /track slash command is a
shorter front-end to the interactive video workflow. Typing /track with no
arguments opens a guided form; filling it out inserts a structured command that
routes to either gui_segment_track_video or sam3_agent_video_track depending
on the selected model.
Examples:
/track
/track video=/path/to/video.mp4 prompt="mouse" model=Cutie
/track video=/path/to/video.mp4 prompt="mouse" model=Cutie to_frame=400
/track /path/to/video.mp4 mouse
Field guide:
video=orpath=points to the source videoprompt=ortext_prompt=provides the text promptmodel=selects from the AI Models dropdown list in the main windowto_frame=jumps tracking to a target frame when you want a bounded run
The guided form also shows which bot provider/model SAM3 will reuse for the vision seed call, so you can confirm the active LLM before inserting the command.
This slash command is the right choice when you want a quick GUI shortcut for
the standard open-video + prompt + track flow. When you choose SAM3 from the
AI Models dropdown, the same slash command uses the SAM3 agent text-prompt
pipeline instead of the legacy GroundingDINO+SAM workflow, and it reuses the
current Annolid bot provider/model for the agent VLM call. If that agent path
cannot complete, Annolid falls back to the standard GUI video workflow so the
video still opens and tracking can continue with the selected GUI tracking
model.
Prompt transaction model¶
Annolid enforces a single prompt type per SAM3 request at a common boundary in Sam3SessionManager.
Transaction rules:
- One request contains exactly one prompt kind:
textorboxesorpoints- Mixed inputs are split into an ordered transaction:
text -> boxes -> points- Point prompts are always tracker prompts:
- point transactions use
obj_id(required by SAM3.1 point refinement) - point labels are normalized to binary foreground/background (
0/1) - Box labels are normalized to binary (
0/1) for SAM3 geometric prompt compatibility.
Why this exists:
- SAM3.1 point prompting rejects mixed text/box payloads in the same request.
- Explicit transaction sequencing removes mixed-prompt edge failures and makes behavior deterministic across normal and windowed runs.
Where implemented:
annolid/segmentation/SAM/sam3/session.py_build_prompt_transaction_steps(...)_execute_prompt_transaction(...)add_prompt(...)
Multi-object prompt identity¶
For canvas-driven prompting, Annolid maps prompts to stable per-instance object ids.
Identity priority:
group_id(if present and valid)- existing label-to-id mapping from loaded annotations/session
- deterministic new id allocation
Effects:
- point/polygon/box prompts can refine/add-back the correct object instead of collapsing to a single default object id.
- object identity remains stable across repeated prompt edits in the same run.
Windowed inference behavior¶
For text-only runs, Annolid uses a windowed strategy to improve long-video stability.
Key properties:
- Reads frames from the source video timeline when input is an
.mp4. - Streams overlapping windows sequentially from the video instead of seeking back to each window start.
- Reuses the temporary window frame directory and only trims stale tail files between windows.
- Chooses larger default windows automatically for long CPU/CUDA runs when the user did not override
window_sizeorstride. - Uses overlapping windows by default (
stride = window_size - 1) for boundary robustness. - Carries visual prompt boxes from nearest neighbor mask frames across windows.
- Reacquires missed frames with visual+text prompts after the primary pass.
- Reacquires partially lost instances as well, and merges recovered masks back into the existing frame instead of replacing already tracked instances.
- Finalizes frame coverage by ensuring expected frame JSON outputs exist and are valid.
This is implemented in Sam3SessionManager in session.py.
Device policy¶
Annolid applies defensive device handling in SAM3 paths.
- Chooses runtime device from user/default config.
- Falls back from unstable MPS paths to CPU when needed.
- Aligns index tensors and data tensors to the same device before
torch.index_select/torch.isinoperations. - Uses strict JSON-safe serialization for frame outputs to avoid malformed files.
Output files¶
Given video .../mouse.mp4, outputs are written under:
.../mouse/
Primary artifacts:
- Per-frame LabelMe JSON:
000000000.json,000000001.json, ... - Annotation store:
mouse_annotations.ndjson - CSV exports (post-processing):
mouse_tracking.csvmouse_tracked.csv
Notes:
- Frames with no masks are still materialized as valid JSON with an empty
shapeslist. - CSV files contain object rows. A frame with zero detections will not necessarily have a row in
*_tracked.csv.
Runtime knobs¶
SAM3-related runtime settings are read from Annolid config/GUI state (advanced parameters).
Common knobs:
checkpoint_pathmax_frame_num_to_trackdevicescore_threshold_detectionnew_det_threshmax_num_objectsmultiplex_countcompile_modeloffload_video_to_cpusliding_window_sizesliding_window_strideuse_sliding_window_for_text_prompt
Interactive session controls (GUI)¶
Annolid exposes notebook-like SAM3 session controls in the GUI:
Reset SAM3 SessionClose SAM3 SessionRemove SAM3 Object…(by object id)
These actions are available from:
- AI & Models menu
- canvas right-click context menu
Safety behavior:
- controls are blocked while prediction is actively running (stop prediction first)
- remove-object runs at the current frame and refreshes loaded prediction shapes
- remove-object dialog prefills object id from selected shape when possible (
group_idfirst, then label mapping)
If sliding_window_size and sliding_window_stride are not set explicitly, Annolid now derives them from runtime context:
- short CPU runs: smaller windows
- long CPU runs: moderate windows to reduce session churn
- CUDA runs: larger windows with moderate overlap
- explicit user values still take priority and are normalized to keep at least 1-frame overlap
Troubleshooting¶
1) Missing dependency error¶
Symptoms:
- startup failure mentioning
iopath,ftfy, or SAM3 extras.
Fix:
pip install ".[sam3]"
2) Device mismatch errors (cuda vs cpu)¶
Symptoms:
RuntimeError: Expected all tensors to be on the same device ... index_select
Status:
- Patched in SAM3 postprocess paths to align indices/tensors by device.
If still seen:
- collect traceback + device setting + model config
- verify same Annolid revision is running in target environment
3) Malformed per-frame JSON (Expecting value ...)¶
Symptoms:
- frame JSON fails to parse in GUI or tracking reports
Status:
- Frame writes now use strict JSON-safe serialization and atomic replace.
- Coverage finalization repairs missing/corrupt frame JSON files.
4) Window-boundary frame drop¶
Symptoms:
- sparse or unstable detections around boundaries (for example, near frames 15/16 with window size 15)
Mitigations in current integration:
- forced overlap by default (
stride < window_size) - nearest-neighbor carry prompts
- post-pass reacquisition on missed frames
If results are still sparse:
- reduce
sliding_window_size - keep overlap (do not set
stride >= window_size) - lower detection thresholds conservatively
5) Annolid Bot run succeeds but the summary file is missing¶
Symptoms:
- the tool returns
ok=true, but you cannot find the saved JSON summary
Check:
- whether
output_dirwas set explicitly - whether the selected workspace is writable
- whether
dry_run=truewas used
Remember:
dry_run=truedoes not execute tracking and does not create the output folder- the summary path reported in the JSON payload is the canonical artifact location
Debug checklist¶
When debugging a run, capture:
- Model + runtime options (window size, stride, thresholds, device)
- Session logs from
sam3_managerandsession - Whether per-frame JSONs are valid JSON for expected frame range
max_framefrom GUI prediction summary vs video frame count- Boundary frame behavior (
N-1,N,N+1around each window split)