From ProventusNova DeveloperWiki
 
(14 intermediate revisions by the same user not shown)
Line 5: Line 5:


==Set up==
==Set up==
Minimal requirements:
* One of the following Jetson:
* One of the following Jetson:
** Jetson Orin NX (16 GB)
** Jetson Orin NX (16 GB)
Line 13: Line 14:
** Jetpack 5 (L4T r35.x)
** Jetpack 5 (L4T r35.x)
** Jetpack 6 (L4T r36.x)
** Jetpack 6 (L4T r36.x)

This tutorial was tested with the following setup:
* x86/x64 based host machine running Ubuntu 24.04
* Jetson Orin NX (16 GB) + CTI Boson carrier board running Jetpack 6.2.1 (L4T 36.4.4).
* One Framos IMX464 camera
* Ethernet cable

1. Connect to the board via ssh:
<syntaxhighlight lang="bash">
ssh <user>@<board_ip>
</syntaxhighlight>

Now you should be inside the board.


== Tutorial ==
== Tutorial ==
Line 52: Line 66:
</syntaxhighlight>
</syntaxhighlight>


You should see the following after the application starts:
<syntaxhighlight lang="bash">
======== Running on http://0.0.0.0:7860 ========
(Press CTRL+C to quit)
</syntaxhighlight>

6. On your host PC, open browser to:
<syntaxhighlight lang="bash">
http://<board_ip>:7860
</syntaxhighlight>

7. Now you are ready to test some prompts! Use the following as examples:
* [a face]
* [a face, a ball]
* [a face [a nose, an eye]]


== Adaptations ==
== Adaptations ==


* In case you are using a camera that does not support Gstreamer element '''v4l2src''', for example a camera with CSI connector, you may encounter the following error:
* If you are using a camera that does not support the GStreamer '''v4l2src''' element—for example, a CSI-connected camera—you may encounter an error similar to the following:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
[ WARN:0@15.685] global cap_gstreamer.cpp:2829 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module v4l2src0 reported: Internal data stream error.
[ WARN:0@15.685] global cap_gstreamer.cpp:2829 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module v4l2src0 reported: Internal data stream error.
</syntaxhighlight>
</syntaxhighlight>


Modify the following code section inside '''async def detection_loop''' method to allow capturing with the camera:
To enable frame capture from such a camera, modify the '''detection_loop''' method as follows:
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
# Delete or comment the three lines of code below
# Delete or comment the three lines of code below
Line 67: Line 96:
# camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
# camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)


# Add the following lines to allow capturing with nvarguscamerasrc
# Add the following GStreamer pipeline to use nvarguscamerasrc
gst_pipeline = (
gst_pipeline = (
f"nvarguscamerasrc ! "
f"nvarguscamerasrc ! "
Line 78: Line 107:
camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)
camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)
</syntaxhighlight>
</syntaxhighlight>

== Performance improvements ==

* Observation: In the original implementation, both frame capture and model prediction were executed inside the same synchronous function ('''_read_and_encode_image'''). This caused the entire video pipeline to block while waiting for the model to finish inference, resulting in several seconds of latency in the browser.

* Improvement: To resolve this, the prediction logic was moved into a separate asynchronous background task ('''prediction_loop'''). This design fully decouples capture from prediction, ensuring that the video stream remains smooth and low-latency, even when inference takes several seconds.

Find below the complete code implementation separating the capture loop from the prediction loop.

<syntaxhighlight lang="python">
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import asyncio
import argparse
from aiohttp import web, WSCloseCode
import logging
import weakref
import cv2
import time
import PIL.Image
import matplotlib.pyplot as plt
from typing import List
from nanoowl.tree import Tree
from nanoowl.tree_predictor import (
TreePredictor
)
from nanoowl.tree_drawing import draw_tree_output
from nanoowl.owl_predictor import OwlPredictor


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("image_encode_engine", type=str)
parser.add_argument("--image_quality", type=int, default=50)
parser.add_argument("--port", type=int, default=7860)
parser.add_argument("--host", type=str, default="0.0.0.0")
parser.add_argument("--camera", type=int, default=0)
parser.add_argument("--resolution", type=str, default="640x480", help="Camera resolution as WIDTHxHEIGHT")
args = parser.parse_args()
width, height = map(int, args.resolution.split("x"))

CAMERA_DEVICE = args.camera
IMAGE_QUALITY = args.image_quality

predictor = TreePredictor(
owl_predictor=OwlPredictor(
image_encoder_engine=args.image_encode_engine
)
)

prompt_data = None
# Shared state between capture and prediction
shared_state = {
"latest_frame": None,
"latest_detections": None,
"frame_count": 0
}

def get_colors(count: int):
cmap = plt.cm.get_cmap("rainbow", count)
colors = []
for i in range(count):
color = cmap(i)
color = [int(255 * value) for value in color]
colors.append(tuple(color))
return colors


def cv2_to_pil(image):
t0 = time.perf_counter_ns()
t1 = time.perf_counter_ns()
dt = (t1 - t0) / 1e9
logging.info(f"CV2 to PIL time: {dt:.3f}s")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
return PIL.Image.fromarray(image)


async def handle_index_get(request: web.Request):
logging.info("handle_index_get")
return web.FileResponse("./index.html")


async def websocket_handler(request):

global prompt_data

ws = web.WebSocketResponse()

await ws.prepare(request)

logging.info("Websocket connected.")

request.app['websockets'].add(ws)

try:
async for msg in ws:
logging.info(f"Received message from websocket.")
if "prompt" in msg.data:
header, prompt = msg.data.split(":")
logging.info("Received prompt: " + prompt)
try:
tree = Tree.from_prompt(prompt)
clip_encodings = predictor.encode_clip_text(tree)
owl_encodings = predictor.encode_owl_text(tree)
prompt_data = {
"tree": tree,
"clip_encodings": clip_encodings,
"owl_encodings": owl_encodings
}
logging.info("Set prompt: " + prompt)
except Exception as e:
print(e)
finally:
request.app['websockets'].discard(ws)

return ws


async def on_shutdown(app: web.Application):
for ws in set(app['websockets']):
await ws.close(code=WSCloseCode.GOING_AWAY,
message='Server shutdown')


async def detection_loop(app: web.Application):

loop = asyncio.get_running_loop()

logging.info("Opening camera.")

# camera = cv2.VideoCapture(CAMERA_DEVICE)
# camera.set(cv2.CAP_PROP_FRAME_WIDTH, width)
# camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
gst_pipeline = (
f"nvarguscamerasrc ! "
f"video/x-raw(memory:NVMM), width={width}, height={height}, format=NV12, framerate=30/1 ! "
f"nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! "
f"appsink"
)

camera = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)
logging.info("Loading predictor.")

def _read_and_encode_image():
re, image = camera.read()
if not re:
return re, None

shared_state["latest_frame"] = image.copy() # keep a copy for prediction
shared_state["frame_count"] += 1

# draw predictions if available (non-blocking)
if prompt_data is not None and shared_state["latest_detections"] is not None:
image = draw_tree_output(image, shared_state["latest_detections"], prompt_data["tree"])

image_jpeg = bytes(
cv2.imencode(".jpg", image, [cv2.IMWRITE_JPEG_QUALITY, IMAGE_QUALITY])[1]
)
return re, image_jpeg

while True:

re, image = await loop.run_in_executor(None, _read_and_encode_image)
if not re:
break
for ws in app["websockets"]:
await ws.send_bytes(image)

camera.release()

async def prediction_loop():
loop = asyncio.get_running_loop()
while True:
await asyncio.sleep(0)

if prompt_data is None or shared_state["latest_frame"] is None:
continue

# Copy latest frame for prediction
frame_to_predict = shared_state["latest_frame"].copy()
image_pil = cv2_to_pil(frame_to_predict)

# Run predictor in thread pool (non-blocking)
detections = await loop.run_in_executor(
None,
lambda: predictor.predict(
image_pil,
tree=prompt_data['tree'],
clip_text_encodings=prompt_data['clip_encodings'],
owl_text_encodings=prompt_data['owl_encodings']
)
)

shared_state["latest_detections"] = detections

async def run_detection_loop(app):
try:
task_video = asyncio.create_task(detection_loop(app))
task_pred = asyncio.create_task(prediction_loop())
yield
await asyncio.gather(task_video, task_pred)
except asyncio.CancelledError:
pass
finally:
task_video.cancel()
task_pred.cancel()
await asyncio.gather(task_video, task_pred, return_exceptions=True)


logging.basicConfig(level=logging.INFO)
app = web.Application()
app['websockets'] = weakref.WeakSet()
app.router.add_get("/", handle_index_get)
app.router.add_route("GET", "/ws", websocket_handler)
app.on_shutdown.append(on_shutdown)
app.cleanup_ctx.append(run_detection_loop)
web.run_app(app, host=args.host, port=args.port)
</syntaxhighlight>

= Need Further Support? =

'''📞 Book Consultation Call:''' [https://proventusnova.com/contact-us/ Show Calendar!]

'''📩 Contact Via Email:''' [mailto:support@proventusnova.com support@proventusnova.com]

'''🌐 Visit Our Website:''' [https://proventusnova.com ProventusNova.com]

Latest revision as of 18:59, 26 November 2025

How to run NVIDIA NanoOWL tutorial

Keywords: NVIDIA Jetson, NanoOWL, object detection

Description

This wiki pages shows how to run the tree prediction with a live camera tutorial from NVIDIA NanoOWL. NanoOWL is a project that optimizes OWL-ViT to run real-time on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.

Set up

Minimal requirements:

  • One of the following Jetson:
    • Jetson Orin NX (16 GB)
    • Jetson Orin Nano (8 GB)
    • Jetson AGX Orin (32 GB or 64 GB)
  • Running one of the following JetPack versions:
    • Jetpack 5 (L4T r35.x)
    • Jetpack 6 (L4T r36.x)

This tutorial was tested with the following setup:

  • x86/x64 based host machine running Ubuntu 24.04
  • Jetson Orin NX (16 GB) + CTI Boson carrier board running Jetpack 6.2.1 (L4T 36.4.4).
  • One Framos IMX464 camera
  • Ethernet cable

1. Connect to the board via ssh:

 
ssh <user>@<board_ip>

Now you should be inside the board.

Tutorial

1. Clone and setup jetson-containers

 
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

2. Run the following to pull or build a compatible container image.

 
jetson-containers run --workdir /opt/nanoowl $(autotag nanoowl)

After running above command you should be in workdir (/opt/nanoowl) inside the container.

3. Verify you have a camera connected:

 
ls /dev/video*

4. Install missing module inside container:

 
pip install aiohttp

Note: If this takes too long or fails, try the following command:

 
pip install aiohttp --index-url https://pypi.org/simple --prefer-binary

5. Run the tree_demo.py example.

 
cd examples/tree_demo
python3 tree_demo.py --camera 0 --resolution 640x480 ../../data/owl_image_encoder_patch32.engine

You should see the following after the application starts:

 
======== Running on http://0.0.0.0:7860 ========
(Press CTRL+C to quit)

6. On your host PC, open browser to:

 
http://<board_ip>:7860

7. Now you are ready to test some prompts! Use the following as examples:

  • [a face]
  • [a face, a ball]
  • [a face [a nose, an eye]]

Adaptations

  • If you are using a camera that does not support the GStreamer v4l2src element—for example, a CSI-connected camera—you may encounter an error similar to the following:
 
[ WARN:0@15.685] global cap_gstreamer.cpp:2829 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module v4l2src0 reported: Internal data stream error.

To enable frame capture from such a camera, modify the detection_loop method as follows:

 
        # Delete or comment the three lines of code below
        # camera = cv2.VideoCapture(CAMERA_DEVICE)
        # camera.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        # camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)

        # Add the following GStreamer pipeline to use nvarguscamerasrc
        gst_pipeline = (
                 f"nvarguscamerasrc ! "
                 f"video/x-raw(memory:NVMM), width={width}, height={height}, format=NV12, framerate=30/1 ! "
                 f"nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! "
                 f"appsink"
        )

        camera = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
        camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)

Performance improvements

  • Observation: In the original implementation, both frame capture and model prediction were executed inside the same synchronous function (_read_and_encode_image). This caused the entire video pipeline to block while waiting for the model to finish inference, resulting in several seconds of latency in the browser.
  • Improvement: To resolve this, the prediction logic was moved into a separate asynchronous background task (prediction_loop). This design fully decouples capture from prediction, ensuring that the video stream remains smooth and low-latency, even when inference takes several seconds.

Find below the complete code implementation separating the capture loop from the prediction loop.

 
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import asyncio
import argparse
from aiohttp import web, WSCloseCode
import logging
import weakref
import cv2
import time
import PIL.Image
import matplotlib.pyplot as plt
from typing import List
from nanoowl.tree import Tree
from nanoowl.tree_predictor import (
    TreePredictor
)
from nanoowl.tree_drawing import draw_tree_output
from nanoowl.owl_predictor import OwlPredictor


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("image_encode_engine", type=str)
    parser.add_argument("--image_quality", type=int, default=50)
    parser.add_argument("--port", type=int, default=7860)
    parser.add_argument("--host", type=str, default="0.0.0.0")
    parser.add_argument("--camera", type=int, default=0)
    parser.add_argument("--resolution", type=str, default="640x480", help="Camera resolution as WIDTHxHEIGHT")
    args = parser.parse_args()
    width, height = map(int, args.resolution.split("x"))

    CAMERA_DEVICE = args.camera
    IMAGE_QUALITY = args.image_quality

    predictor = TreePredictor(
        owl_predictor=OwlPredictor(
            image_encoder_engine=args.image_encode_engine
        )
    )

    prompt_data = None
    
    # Shared state between capture and prediction
    shared_state = {
        "latest_frame": None,
        "latest_detections": None,
        "frame_count": 0
    }

    def get_colors(count: int):
        cmap = plt.cm.get_cmap("rainbow", count)
        colors = []
        for i in range(count):
            color = cmap(i)
            color = [int(255 * value) for value in color]
            colors.append(tuple(color))
        return colors


    def cv2_to_pil(image):
        t0 = time.perf_counter_ns()
        t1 = time.perf_counter_ns()
        dt = (t1 - t0) / 1e9
        logging.info(f"CV2 to PIL time: {dt:.3f}s")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return PIL.Image.fromarray(image)


    async def handle_index_get(request: web.Request):
        logging.info("handle_index_get")
        return web.FileResponse("./index.html")


    async def websocket_handler(request):

        global prompt_data

        ws = web.WebSocketResponse()

        await ws.prepare(request)

        logging.info("Websocket connected.")

        request.app['websockets'].add(ws)

        try:
            async for msg in ws:
                logging.info(f"Received message from websocket.")
                if "prompt" in msg.data:
                    header, prompt = msg.data.split(":")
                    logging.info("Received prompt: " + prompt)
                    try:
                        tree = Tree.from_prompt(prompt)
                        clip_encodings = predictor.encode_clip_text(tree)
                        owl_encodings = predictor.encode_owl_text(tree)
                        prompt_data = {
                            "tree": tree,
                            "clip_encodings": clip_encodings,
                            "owl_encodings": owl_encodings
                        }
                        logging.info("Set prompt: " + prompt)
                    except Exception as e:
                        print(e)
        finally:
            request.app['websockets'].discard(ws)

        return ws


    async def on_shutdown(app: web.Application):
        for ws in set(app['websockets']):
            await ws.close(code=WSCloseCode.GOING_AWAY,
                        message='Server shutdown')


    async def detection_loop(app: web.Application):

        loop = asyncio.get_running_loop()

        logging.info("Opening camera.")

        # camera = cv2.VideoCapture(CAMERA_DEVICE)
        # camera.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        # camera.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
        gst_pipeline = (
                 f"nvarguscamerasrc ! "
                 f"video/x-raw(memory:NVMM), width={width}, height={height}, format=NV12, framerate=30/1 ! "
                 f"nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! "
                 f"appsink"
        )

        camera = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
        camera.set(cv2.CAP_PROP_BUFFERSIZE, 1)
        logging.info("Loading predictor.")

        def _read_and_encode_image():
            re, image = camera.read()
            if not re:
                return re, None

            shared_state["latest_frame"] = image.copy()  # keep a copy for prediction
            shared_state["frame_count"] += 1

            # draw predictions if available (non-blocking)
            if prompt_data is not None and shared_state["latest_detections"] is not None:
                image = draw_tree_output(image, shared_state["latest_detections"], prompt_data["tree"])

            image_jpeg = bytes(
                cv2.imencode(".jpg", image, [cv2.IMWRITE_JPEG_QUALITY, IMAGE_QUALITY])[1]
            )
            return re, image_jpeg

        while True:

            re, image = await loop.run_in_executor(None, _read_and_encode_image)
            
            if not re:
                break
            
            for ws in app["websockets"]:
                await ws.send_bytes(image)

        camera.release()

    async def prediction_loop():
        loop = asyncio.get_running_loop()
        while True:
            await asyncio.sleep(0)  

            if prompt_data is None or shared_state["latest_frame"] is None:
                continue

            # Copy latest frame for prediction
            frame_to_predict = shared_state["latest_frame"].copy()
            image_pil = cv2_to_pil(frame_to_predict)

            # Run predictor in thread pool (non-blocking)
            detections = await loop.run_in_executor(
                None,
                lambda: predictor.predict(
                    image_pil,
                    tree=prompt_data['tree'],
                    clip_text_encodings=prompt_data['clip_encodings'],
                    owl_text_encodings=prompt_data['owl_encodings']
                )
            )

            shared_state["latest_detections"] = detections

    async def run_detection_loop(app):
        try:
            task_video = asyncio.create_task(detection_loop(app))
            task_pred = asyncio.create_task(prediction_loop())
    
            yield 
            await asyncio.gather(task_video, task_pred)
        except asyncio.CancelledError:
            pass
        finally:
            task_video.cancel()
            task_pred.cancel()
            await asyncio.gather(task_video, task_pred, return_exceptions=True)


    logging.basicConfig(level=logging.INFO)
    app = web.Application()
    app['websockets'] = weakref.WeakSet()
    app.router.add_get("/", handle_index_get)
    app.router.add_route("GET", "/ws", websocket_handler)
    app.on_shutdown.append(on_shutdown)
    app.cleanup_ctx.append(run_detection_loop)
    web.run_app(app, host=args.host, port=args.port)

Need Further Support?

📞 Book Consultation Call: Show Calendar!

📩 Contact Via Email: support@proventusnova.com

🌐 Visit Our Website: ProventusNova.com