In the documentation it is said that this model has visual grounding (object detection and segmentation), what is the best way to use this from this model (As I understand llama only outputs text tokens) ?
x2
· Sign up or log in to comment