the model "depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf" doesnt work as expected, produces a full black depthmap

#1
by Abbasid - opened

hi so im trying to use the model as described here, but i keep getting just blank pure black depth maps

image.png

when I go back to the original models "depth-anything/Depth-Anything-V2-Small-hf" I get an accurate depthmap

image.png

I tried inferring the model both using the high level pipeline API and the manual way, and the result is the same.

what could be the issue?

Hi @Abbasid , it looks like the feature "metric depth" for DepthAnything is not in the 4.44.0 release yet, but you can use it if you update transformers to the latest main as follows:

# update transformers to latest main
!pip install git+https://github.com/huggingface/transformers

from transformers import AutoImageProcessor, AutoModelForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("depth-anything/depth-anything-V2-metric-outdoor-small-hf")
model = AutoModelForDepthEstimation.from_pretrained("depth-anything/depth-anything-V2-metric-outdoor-small-hf")

# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
) 

# visualize the output
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

depth

I believe this is the issue you are facing because it is working for me:
image.png

Hello @bthia97
Thanks I managed to get it working with the help of
!pip install git+https://github.com/huggingface/transformers
another question these values are supposed to represent metric depth right? so if at a pixel I got the value 60,
that means this object is 60m ? cause I got way off values for my objects of known sizes,
or is something wrong with my understanding?

also, when we infer with pipe whats the difference between depth and predicted depth?
depth is a PIL image with the depth...

Sign up or log in to comment