Latest revision as of 18:59, 26 November 2025

How to run NVIDIA NanoOWL tutorial

Keywords: NVIDIA Jetson, NanoOWL, object detection

Description

This wiki pages shows how to run the tree prediction with a live camera tutorial from NVIDIA NanoOWL. NanoOWL is a project that optimizes OWL-ViT to run real-time on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.

Set up

Minimal requirements:

One of the following Jetson:
- Jetson Orin NX (16 GB)
- Jetson Orin Nano (8 GB)
- Jetson AGX Orin (32 GB or 64 GB)

Running one of the following JetPack versions:
- Jetpack 5 (L4T r35.x)
- Jetpack 6 (L4T r36.x)

This tutorial was tested with the following setup:

x86/x64 based host machine running Ubuntu 24.04
Jetson Orin NX (16 GB) + CTI Boson carrier board running Jetpack 6.2.1 (L4T 36.4.4).
One Framos IMX464 camera
Ethernet cable

1. Connect to the board via ssh:

 
ssh <user>@<board_ip>

Now you should be inside the board.

Tutorial

1. Clone and setup jetson-containers

 
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

2. Run the following to pull or build a compatible container image.

 
jetson-containers run --workdir /opt/nanoowl $(autotag nanoowl)

After running above command you should be in workdir (/opt/nanoowl) inside the container.

3. Verify you have a camera connected:

 
ls /dev/video*

4. Install missing module inside container:

 
pip install aiohttp

Note: If this takes too long or fails, try the following command:

 
pip install aiohttp --index-url https://pypi.org/simple --prefer-binary

5. Run the tree_demo.py example.

 
cd examples/tree_demo
python3 tree_demo.py --camera 0 --resolution 640x480 ../../data/owl_image_encoder_patch32.engine

You should see the following after the application starts:

 
======== Running on http://0.0.0.0:7860 ========
(Press CTRL+C to quit)

6. On your host PC, open browser to:

 
http://<board_ip>:7860

7. Now you are ready to test some prompts! Use the following as examples:

[a face]
[a face, a ball]
[a face [a nose, an eye]]

Adaptations

If you are using a camera that does not support the GStreamer v4l2src element—for example, a CSI-connected camera—you may encounter an error similar to the following:

 
[ WARN:0@15.685] global cap_gstreamer.cpp:2829 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module v4l2src0 reported: Internal data stream error.

To enable frame capture from such a camera, modify the detection_loop method as follows:

 
        # Delete or comment the three lines of code below
        # camera = cv2.VideoCapture(CAMERA_DEVICE)
        # camera.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        # camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)

        # Add the following GStreamer pipeline to use nvarguscamerasrc
        gst_pipeline = (
                 f"nvarguscamerasrc ! "
                 f"video/x-raw(memory:NVMM), width={width}, height={height}, format=NV12, framerate=30/1 ! "
                 f"nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! "
                 f"appsink"
        )

        camera = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
        camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)

Performance improvements

Observation: In the original implementation, both frame capture and model prediction were executed inside the same synchronous function (_read_and_encode_image). This caused the entire video pipeline to block while waiting for the model to finish inference, resulting in several seconds of latency in the browser.

Improvement: To resolve this, the prediction logic was moved into a separate asynchronous background task (prediction_loop). This design fully decouples capture from prediction, ensuring that the video stream remains smooth and low-latency, even when inference takes several seconds.

Find below the complete code implementation separating the capture loop from the prediction loop.

 
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import asyncio
import argparse
from aiohttp import web, WSCloseCode
import logging
import weakref
import cv2
import time
import PIL.Image
import matplotlib.pyplot as plt
from typing import List
from nanoowl.tree import Tree
from nanoowl.tree_predictor import (
    TreePredictor
)
from nanoowl.tree_drawing import draw_tree_output
from nanoowl.owl_predictor import OwlPredictor


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("image_encode_engine", type=str)
    parser.add_argument("--image_quality", type=int, default=50)
    parser.add_argument("--port", type=int, default=7860)
    parser.add_argument("--host", type=str, default="0.0.0.0")
    parser.add_argument("--camera", type=int, default=0)
    parser.add_argument("--resolution", type=str, default="640x480", help="Camera resolution as WIDTHxHEIGHT")
    args = parser.parse_args()
    width, height = map(int, args.resolution.split("x"))

    CAMERA_DEVICE = args.camera
    IMAGE_QUALITY = args.image_quality

    predictor = TreePredictor(
        owl_predictor=OwlPredictor(
            image_encoder_engine=args.image_encode_engine
        )
    )

    prompt_data = None
    
    # Shared state between capture and prediction
    shared_state = {
        "latest_frame": None,
        "latest_detections": None,
        "frame_count": 0
    }

    def get_colors(count: int):
        cmap = plt.cm.get_cmap("rainbow", count)
        colors = []
        for i in range(count):
            color = cmap(i)
            color = [int(255 * value) for value in color]
            colors.append(tuple(color))
        return colors


    def cv2_to_pil(image):
        t0 = time.perf_counter_ns()
        t1 = time.perf_counter_ns()
        dt = (t1 - t0) / 1e9
        logging.info(f"CV2 to PIL time: {dt:.3f}s")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return PIL.Image.fromarray(image)


    async def handle_index_get(request: web.Request):
        logging.info("handle_index_get")
        return web.FileResponse("./index.html")


    async def websocket_handler(request):

        global prompt_data

        ws = web.WebSocketResponse()

        await ws.prepare(request)

        logging.info("Websocket connected.")

        request.app['websockets'].add(ws)

        try:
            async for msg in ws:
                logging.info(f"Received message from websocket.")
                if "prompt" in msg.data:
                    header, prompt = msg.data.split(":")
                    logging.info("Received prompt: " + prompt)
                    try:
                        tree = Tree.from_prompt(prompt)
                        clip_encodings = predictor.encode_clip_text(tree)
                        owl_encodings = predictor.encode_owl_text(tree)
                        prompt_data = {
                            "tree": tree,
                            "clip_encodings": clip_encodings,
                            "owl_encodings": owl_encodings
                        }
                        logging.info("Set prompt: " + prompt)
                    except Exception as e:
                        print(e)
        finally:
            request.app['websockets'].discard(ws)

        return ws


    async def on_shutdown(app: web.Application):
        for ws in set(app['websockets']):
            await ws.close(code=WSCloseCode.GOING_AWAY,
                        message='Server shutdown')


    async def detection_loop(app: web.Application):

        loop = asyncio.get_running_loop()

        logging.info("Opening camera.")

        # camera = cv2.VideoCapture(CAMERA_DEVICE)
        # camera.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        # camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
        gst_pipeline = (
                 f"nvarguscamerasrc ! "
                 f"video/x-raw(memory:NVMM), width={width}, height={height}, format=NV12, framerate=30/1 ! "
                 f"nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! "
                 f"appsink"
        )

        camera = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
        camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)
        logging.info("Loading predictor.")

        def _read_and_encode_image():
            re, image = camera.read()
            if not re:
                return re, None

            shared_state["latest_frame"] = image.copy()  # keep a copy for prediction
            shared_state["frame_count"] += 1

            # draw predictions if available (non-blocking)
            if prompt_data is not None and shared_state["latest_detections"] is not None:
                image = draw_tree_output(image, shared_state["latest_detections"], prompt_data["tree"])

            image_jpeg = bytes(
                cv2.imencode(".jpg", image, [cv2.IMWRITE_JPEG_QUALITY, IMAGE_QUALITY])[1]
            )
            return re, image_jpeg

        while True:

            re, image = await loop.run_in_executor(None, _read_and_encode_image)
            
            if not re:
                break
            
            for ws in app["websockets"]:
                await ws.send_bytes(image)

        camera.release()

    async def prediction_loop():
        loop = asyncio.get_running_loop()
        while True:
            await asyncio.sleep(0)  

            if prompt_data is None or shared_state["latest_frame"] is None:
                continue

            # Copy latest frame for prediction
            frame_to_predict = shared_state["latest_frame"].copy()
            image_pil = cv2_to_pil(frame_to_predict)

            # Run predictor in thread pool (non-blocking)
            detections = await loop.run_in_executor(
                None,
                lambda: predictor.predict(
                    image_pil,
                    tree=prompt_data['tree'],
                    clip_text_encodings=prompt_data['clip_encodings'],
                    owl_text_encodings=prompt_data['owl_encodings']
                )
            )

            shared_state["latest_detections"] = detections

    async def run_detection_loop(app):
        try:
            task_video = asyncio.create_task(detection_loop(app))
            task_pred = asyncio.create_task(prediction_loop())
    
            yield 
            await asyncio.gather(task_video, task_pred)
        except asyncio.CancelledError:
            pass
        finally:
            task_video.cancel()
            task_pred.cancel()
            await asyncio.gather(task_video, task_pred, return_exceptions=True)


    logging.basicConfig(level=logging.INFO)
    app = web.Application()
    app['websockets'] = weakref.WeakSet()
    app.router.add_get("/", handle_index_get)
    app.router.add_route("GET", "/ws", websocket_handler)
    app.on_shutdown.append(on_shutdown)
    app.cleanup_ctx.append(run_detection_loop)
    web.run_app(app, host=args.host, port=args.port)

Need Further Support?

📞 Book Consultation Call: Show Calendar!

📩 Contact Via Email: support@proventusnova.com

🌐 Visit Our Website: ProventusNova.com

Anonymous

Search

Namespaces

More

Page actions