aiflows
/

VisionFlowModule

Model card Files Files and versions Community

nbaldwin commited on Nov 24, 2023

Commit

ae51174

•

1 Parent(s): c296fdd

readme + demo

Browse files

Files changed (6) hide show

README.md +153 -3
VisionAtomicFlow.py +116 -2
VisionAtomicFlow.yaml +11 -7
__init__.py +1 -1
demo.yaml +20 -0
run.py +91 -0

README.md CHANGED Viewed

@@ -1,3 +1,153 @@
----
-license: mit
----

+# Table of Contents
+* [VisionAtomicFlow](#VisionAtomicFlow)
+  * [VisionAtomicFlow](#VisionAtomicFlow.VisionAtomicFlow)
+    * [get\_image](#VisionAtomicFlow.VisionAtomicFlow.get_image)
+    * [get\_video](#VisionAtomicFlow.VisionAtomicFlow.get_video)
+    * [get\_user\_message](#VisionAtomicFlow.VisionAtomicFlow.get_user_message)
+<a id="VisionAtomicFlow"></a>
+# VisionAtomicFlow
+<a id="VisionAtomicFlow.VisionAtomicFlow"></a>
+## VisionAtomicFlow Objects
+```python
+class VisionAtomicFlow(OpenAIChatAtomicFlow)
+```
+This class implements the atomic flow for the VisionFlowModule. It is a flow that, given a textual input, and a set of images and/or videos, generates a textual output.
+It uses the litellm library as a backend. See https://docs.litellm.ai/docs/providers for supported models and APIs.
+*Configuration Parameters*:
+- `name` (str): The name of the flow. Default: "VisionAtomicFlow"
+- `description` (str): A description of the flow. This description is used to generate the help message of the flow.
+Default: "A flow that, given a textual input, and a set of images and/or videos, generates a textual output."
+- enable_cache (bool): If True, the flow will use the cache. Default: True
+- `n_api_retries` (int): The number of times to retry the API call in case of failure. Default: 6
+- `wait_time_between_api_retries` (int): The time to wait between API retries in seconds. Default: 20
+- `system_name` (str): The name of the system. Default: "system"
+- `user_name` (str): The name of the user. Default: "user"
+- `assistant_name` (str): The name of the assistant. Default: "assistant"
+- `backend` (Dict[str, Any]): The configuration of the backend which is used to fetch api keys. Default: LiteLLMBackend with the
+default parameters of ChatAtomicFlow (see Flow card of ChatAtomicFlowModule). Except for the following parameters
+whose default value is overwritten:
+    - `api_infos` (List[Dict[str, Any]]): The list of api infos. Default: No default value, this parameter is required.
+    - `model_name` (Union[Dict[str,str],str]): The name of the model to use.
+    When using multiple API providers, the model_name can be a dictionary of the form
+    {"provider_name": "model_name"}.
+    Default: "gpt-4-vision-preview" (the name needs to follow the name of the model in litellm  https://docs.litellm.ai/docs/providers).
+    - `n` (int) : The number of answers to generate. Default: 1
+    - `max_tokens` (int): The maximum number of tokens to generate. Default: 2000
+    - `temperature` (float): The temperature to use. Default: 0.3
+    - `top_p` (float): An alternative to sampling with temperature. It instructs the model to consider the results of
+    the tokens with top_p probability. Default: 0.2
+    - `frequency_penalty` (float): The higher this value, the more likely the model will repeat itself. Default: 0.0
+    - `presence_penalty` (float): The higher this value, the less likely the model will talk about a new topic. Default: 0.0
+- `system_message_prompt_template` (Dict[str,Any]): The template of the system message. It is used to generate the system message.
+By default its of type flows.prompt_template.JinjaPrompt.
+None of the parameters of the prompt are defined by default and therefore need to be defined if one wants to use the system prompt.
+Default parameters are defined in flows.prompt_template.jinja2_prompts.JinjaPrompt.
+- `init_human_message_prompt_template` (Dict[str,Any]): The prompt template of the human/user message used to initialize the conversation
+(first time in). It is used to generate the human message. It's passed as the user message to the LLM.
+By default its of type flows.prompt_template.JinjaPrompt. None of the parameters of the prompt are defined by default and therefore need to be defined if one
+wants to use the init_human_message_prompt_template. Default parameters are defined in flows.prompt_template.jinja2_prompts.JinjaPrompt.
+- `previous_messages` (Dict[str,Any]): Defines which previous messages to include in the input of the LLM. Note that if `first_k`and `last_k` are both none,
+all the messages of the flows's history are added to the input of the LLM. Default:
+    - `first_k` (int): If defined, adds the first_k earliest messages of the flow's chat history to the input of the LLM. Default: None
+    - `last_k` (int): If defined, adds the last_k latest messages of the flow's chat history to the input of the LLM. Default: None
+- Other parameters are inherited from the default configuration of ChatAtomicFlow (see Flow card of ChatAtomicFlowModule).
+*Input Interface Initialized (Expected input the first time in flow)*:
+- `query` (str): The textual query to run the model on.
+- `data` (Dict[str, Any]): The data (images or video) to run the model on. It can contain the following keys:
+    - `images` (List[Dict[str, Any]]): A list of images to run the model on. Each image is a dictionary that contains the following keys:
+        - `type` (str): The type of the image. It can be "local_path" or "url".
+        - `image` (str): The image. If type is "local_path", it is a local path to the image. If type is "url", it is a url to the image.
+    - `video` (Dict[str, Any]): A video to run the model on. It is a dictionary that contains the following keys:
+        - `video_path` (str): The path to the video.
+        - `resize` (int): The resize we want to apply on the frames of the video.
+        - `frame_step_size` (int): The step size between the frames of the video (to send to the model).
+        - `start_frame` (int): The start frame of the video (to send to the model).
+        - `end_frame` (int): The last frame of the video (to send to the model).
+*Input Interface (Expected input the after the first time in flow)*:
+- `query` (str): The textual query to run the model on.
+- `data` (Dict[str, Any]): The data (images or video) to run the model on. It can contain the following keys:
+    - `images` (List[Dict[str, Any]]): A list of images to run the model on. Each image is a dictionary that contains the following keys:
+        - `type` (str): The type of the image. It can be "local_path" or "url".
+        - `image` (str): The image. If type is "local_path", it is a local path to the image. If type is "url", it is a url to the image.
+    - `video` (Dict[str, Any]): A video to run the model on. It is a dictionary that contains the following keys:
+        - `video_path` (str): The path to the video.
+        - `resize` (int): The resize we want to apply on the frames of the video.
+        - `frame_step_size` (int): The step size between the frames of the video (to send to the model).
+        - `start_frame` (int): The start frame of the video (to send to the model).
+        - `end_frame` (int): The last frame of the video (to send to the model).
+*Output Interface*:
+    - `api_output`s (str): The api output of the flow to the query and data
+<a id="VisionAtomicFlow.VisionAtomicFlow.get_image"></a>
+#### get\_image
+```python
+@staticmethod
+def get_image(image)
+```
+This method returns an image in the appropriate format for API.
+**Arguments**:
+- `image` (`Dict[str, Any]`): The image dictionary.
+**Returns**:
+`Dict[str, Any]`: The image url.
+<a id="VisionAtomicFlow.VisionAtomicFlow.get_video"></a>
+#### get\_video
+```python
+@staticmethod
+def get_video(video)
+```
+This method returns the video in the appropriate format for API.
+**Arguments**:
+- `video` (`Dict[str, Any]`): The video dictionary.
+**Returns**:
+`Dict[str, Any]`: The video url.
+<a id="VisionAtomicFlow.VisionAtomicFlow.get_user_message"></a>
+#### get\_user\_message
+```python
+@staticmethod
+def get_user_message(prompt_template, input_data: Dict[str, Any])
+```
+This method constructs the user message to be passed to the API.
+**Arguments**:
+- `prompt_template` (`PromptTemplate`): The prompt template to use.
+- `input_data` (`Dict[str, Any]`): The input data.
+**Returns**:
+`Dict[str, Any]`: The constructed user message (images , videos and text).

VisionAtomicFlow.py CHANGED Viewed

@@ -1,14 +1,96 @@
 from typing import Dict, Any
-from flow_modules.aiflows.OpenAIChatFlowModule import OpenAIChatAtomicFlow
 from flows.utils.general_helpers import encode_image,encode_from_buffer
 import cv2
-class VisionAtomicFlow(OpenAIChatAtomicFlow):
     @staticmethod
     def get_image(image):
         extension_dict = {
             "jpg": "jpeg",
             "jpeg": "jpeg",
@@ -34,6 +116,13 @@ class VisionAtomicFlow(OpenAIChatAtomicFlow):
     @staticmethod
     def get_video(video):
         video_path = video["video_path"]
         resize = video.get("resize",768)
         frame_step_size = video.get("frame_step_size",10)
@@ -52,6 +141,15 @@ class VisionAtomicFlow(OpenAIChatAtomicFlow):
     @staticmethod
     def get_user_message(prompt_template, input_data: Dict[str, Any]):
         content = VisionAtomicFlow._get_message(prompt_template=prompt_template,input_data=input_data)
         media_data = input_data["data"]
         if "video" in media_data:
@@ -63,6 +161,15 @@ class VisionAtomicFlow(OpenAIChatAtomicFlow):
     @staticmethod
     def _get_message(prompt_template, input_data: Dict[str, Any]):
         template_kwargs = {}
         for input_variable in prompt_template.input_variables:
             template_kwargs[input_variable] = input_data[input_variable]
@@ -70,6 +177,13 @@ class VisionAtomicFlow(OpenAIChatAtomicFlow):
         return [{"type": "text", "text": msg_content}]
     def _process_input(self, input_data: Dict[str, Any]):
         if self._is_conversation_initialized():
             # Construct the message using the human message prompt template
             user_message_content = self.get_user_message(self.human_message_prompt_template, input_data)

 from typing import Dict, Any
+from flow_modules.aiflows.ChatFlowModule import ChatAtomicFlow
 from flows.utils.general_helpers import encode_image,encode_from_buffer
 import cv2
+class VisionAtomicFlow(ChatAtomicFlow):
+    """ This class implements the atomic flow for the VisionFlowModule. It is a flow that, given a textual input, and a set of images and/or videos, generates a textual output.
+    It uses the litellm library as a backend. See https://docs.litellm.ai/docs/providers for supported models and APIs.
+    *Configuration Parameters*:
+    - `name` (str): The name of the flow. Default: "VisionAtomicFlow"
+    - `description` (str): A description of the flow. This description is used to generate the help message of the flow.
+    Default: "A flow that, given a textual input, and a set of images and/or videos, generates a textual output."
+    - enable_cache (bool): If True, the flow will use the cache. Default: True
+    - `n_api_retries` (int): The number of times to retry the API call in case of failure. Default: 6
+    - `wait_time_between_api_retries` (int): The time to wait between API retries in seconds. Default: 20
+    - `system_name` (str): The name of the system. Default: "system"
+    - `user_name` (str): The name of the user. Default: "user"
+    - `assistant_name` (str): The name of the assistant. Default: "assistant"
+    - `backend` (Dict[str, Any]): The configuration of the backend which is used to fetch api keys. Default: LiteLLMBackend with the
+    default parameters of ChatAtomicFlow (see Flow card of ChatAtomicFlowModule). Except for the following parameters
+    whose default value is overwritten:
+        - `api_infos` (List[Dict[str, Any]]): The list of api infos. Default: No default value, this parameter is required.
+        - `model_name` (Union[Dict[str,str],str]): The name of the model to use.
+        When using multiple API providers, the model_name can be a dictionary of the form
+        {"provider_name": "model_name"}.
+        Default: "gpt-4-vision-preview" (the name needs to follow the name of the model in litellm  https://docs.litellm.ai/docs/providers).
+        - `n` (int) : The number of answers to generate. Default: 1
+        - `max_tokens` (int): The maximum number of tokens to generate. Default: 2000
+        - `temperature` (float): The temperature to use. Default: 0.3
+        - `top_p` (float): An alternative to sampling with temperature. It instructs the model to consider the results of
+        the tokens with top_p probability. Default: 0.2
+        - `frequency_penalty` (float): The higher this value, the more likely the model will repeat itself. Default: 0.0
+        - `presence_penalty` (float): The higher this value, the less likely the model will talk about a new topic. Default: 0.0
+    - `system_message_prompt_template` (Dict[str,Any]): The template of the system message. It is used to generate the system message.
+    By default its of type flows.prompt_template.JinjaPrompt.
+    None of the parameters of the prompt are defined by default and therefore need to be defined if one wants to use the system prompt.
+    Default parameters are defined in flows.prompt_template.jinja2_prompts.JinjaPrompt.
+    - `init_human_message_prompt_template` (Dict[str,Any]): The prompt template of the human/user message used to initialize the conversation
+    (first time in). It is used to generate the human message. It's passed as the user message to the LLM.
+    By default its of type flows.prompt_template.JinjaPrompt. None of the parameters of the prompt are defined by default and therefore need to be defined if one
+    wants to use the init_human_message_prompt_template. Default parameters are defined in flows.prompt_template.jinja2_prompts.JinjaPrompt.
+    - `previous_messages` (Dict[str,Any]): Defines which previous messages to include in the input of the LLM. Note that if `first_k`and `last_k` are both none,
+    all the messages of the flows's history are added to the input of the LLM. Default:
+        - `first_k` (int): If defined, adds the first_k earliest messages of the flow's chat history to the input of the LLM. Default: None
+        - `last_k` (int): If defined, adds the last_k latest messages of the flow's chat history to the input of the LLM. Default: None
+    - Other parameters are inherited from the default configuration of ChatAtomicFlow (see Flow card of ChatAtomicFlowModule).
+    *Input Interface Initialized (Expected input the first time in flow)*:
+    - `query` (str): The textual query to run the model on.
+    - `data` (Dict[str, Any]): The data (images or video) to run the model on. It can contain the following keys:
+        - `images` (List[Dict[str, Any]]): A list of images to run the model on. Each image is a dictionary that contains the following keys:
+            - `type` (str): The type of the image. It can be "local_path" or "url".
+            - `image` (str): The image. If type is "local_path", it is a local path to the image. If type is "url", it is a url to the image.
+        - `video` (Dict[str, Any]): A video to run the model on. It is a dictionary that contains the following keys:
+            - `video_path` (str): The path to the video.
+            - `resize` (int): The resize we want to apply on the frames of the video.
+            - `frame_step_size` (int): The step size between the frames of the video (to send to the model).
+            - `start_frame` (int): The start frame of the video (to send to the model).
+            - `end_frame` (int): The last frame of the video (to send to the model).
+    *Input Interface (Expected input the after the first time in flow)*:
+    - `query` (str): The textual query to run the model on.
+    - `data` (Dict[str, Any]): The data (images or video) to run the model on. It can contain the following keys:
+        - `images` (List[Dict[str, Any]]): A list of images to run the model on. Each image is a dictionary that contains the following keys:
+            - `type` (str): The type of the image. It can be "local_path" or "url".
+            - `image` (str): The image. If type is "local_path", it is a local path to the image. If type is "url", it is a url to the image.
+        - `video` (Dict[str, Any]): A video to run the model on. It is a dictionary that contains the following keys:
+            - `video_path` (str): The path to the video.
+            - `resize` (int): The resize we want to apply on the frames of the video.
+            - `frame_step_size` (int): The step size between the frames of the video (to send to the model).
+            - `start_frame` (int): The start frame of the video (to send to the model).
+            - `end_frame` (int): The last frame of the video (to send to the model).
+    *Output Interface*:
+        - `api_output`s (str): The api output of the flow to the query and data
+    """
     @staticmethod
     def get_image(image):
+        """ This method returns an image in the appropriate format for API.
+        :param image: The image dictionary.
+        :type image: Dict[str, Any]
+        :return: The image url.
+        :rtype: Dict[str, Any]
+        """
         extension_dict = {
             "jpg": "jpeg",
             "jpeg": "jpeg",
     @staticmethod
     def get_video(video):
+        """ This method returns the video in the appropriate format for API.
+        :param video: The video dictionary.
+        :type video: Dict[str, Any]
+        :return: The video url.
+        :rtype: Dict[str, Any]
+        """
         video_path = video["video_path"]
         resize = video.get("resize",768)
         frame_step_size = video.get("frame_step_size",10)
     @staticmethod
     def get_user_message(prompt_template, input_data: Dict[str, Any]):
+        """ This method constructs the user message to be passed to the API.
+        :param prompt_template: The prompt template to use.
+        :type prompt_template: PromptTemplate
+        :param input_data: The input data.
+        :type input_data: Dict[str, Any]
+        :return: The constructed user message (images , videos and text).
+        :rtype: Dict[str, Any]
+        """
         content = VisionAtomicFlow._get_message(prompt_template=prompt_template,input_data=input_data)
         media_data = input_data["data"]
         if "video" in media_data:
     @staticmethod
     def _get_message(prompt_template, input_data: Dict[str, Any]):
+        """ This method constructs the textual message to be passed to the API.
+        :param prompt_template: The prompt template to use.
+        :type prompt_template: PromptTemplate
+        :param input_data: The input data.
+        :type input_data: Dict[str, Any]
+        :return: The constructed textual message.
+        :rtype: Dict[str, Any]
+        """
         template_kwargs = {}
         for input_variable in prompt_template.input_variables:
             template_kwargs[input_variable] = input_data[input_variable]
         return [{"type": "text", "text": msg_content}]
     def _process_input(self, input_data: Dict[str, Any]):
+        """ This method processes the input data (prepares the messages to send to the API).
+        :param input_data: The input data.
+        :type input_data: Dict[str, Any]
+        :return: The processed input data.
+        :rtype: Dict[str, Any]
+        """
         if self._is_conversation_initialized():
             # Construct the message using the human message prompt template
             user_message_content = self.get_user_message(self.human_message_prompt_template, input_data)

VisionAtomicFlow.yaml CHANGED Viewed

@@ -1,4 +1,6 @@
-# This is an abstract flow, therefore some required fields are not defined (and must be defined by the concrete flow)
 enable_cache: True
 n_api_retries: 6
@@ -30,20 +32,22 @@ human_message_prompt_template:
   template: "{{query}}"
   input_variables:
     - "query"
 input_interface_initialized:
   - "query"
   - "data"
-query_message_prompt_template:
-  _target_: flows.prompt_template.JinjaPrompt
 previous_messages:
   first_k: null  # Note that the first message is the system prompt
   last_k: null
-demonstrations: null
-demonstrations_response_template: null
 output_interface:
   - "api_output"

+name: "VisionAtomicFlow"
+description: "A flow that, given a textual input, and a set of images and/or videos, generates a textual output."
 enable_cache: True
 n_api_retries: 6
   template: "{{query}}"
   input_variables:
     - "query"
 input_interface_initialized:
   - "query"
   - "data"
 previous_messages:
   first_k: null  # Note that the first message is the system prompt
   last_k: null
+input_interface:
+  - "query"
+  - "data"
+input_interface_non_initialized:
+  - "question"
+  - "data"
 output_interface:
   - "api_output"

__init__.py CHANGED Viewed

@@ -1,6 +1,6 @@
 # ~~~ Specify the dependencies ~~
 dependencies = [
-    {"url": "aiflows/OpenAIChatFlowModule", "revision": "eeec09b71e967ce426553e2300c5689f6ea6a662"}
 ]
 from flows import flow_verse
 flow_verse.sync_dependencies(dependencies)

 # ~~~ Specify the dependencies ~~
 dependencies = [
+    {"url": "aiflows/ChatFlowModule", "revision": "a749ad10ed39776ba6721c37d0dc22af49ca0f17"}
 ]
 from flows import flow_verse
 flow_verse.sync_dependencies(dependencies)

demo.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+flow:
+  _target_: aiflows.VisionFlowModule.VisionAtomicFlow.instantiate_from_default_config
+  name: "Demo Vision Flow"
+  description: "A flow that, given a textual input, and a set of images and/or videos, generates a textual output."
+  backend:
+    api_infos: ???
+  system_message_prompt_template:
+    template: |2-
+      You are a helpful chatbot that truthfully answers questions.
+    input_variables: []
+    partial_variables: {}
+  init_human_message_prompt_template:
+    template: |2-
+      {{query}}
+    input_variables: ["query"]
+    partial_variables: {}

run.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import os
+import hydra
+from flows.flow_launchers import FlowLauncher
+from flows.backends.api_info import ApiInfo
+from flows.utils.general_helpers import read_yaml_file
+from flows import logging
+from flows.flow_cache import CACHING_PARAMETERS, clear_cache
+CACHING_PARAMETERS.do_caching = False  # Set to True in order to disable caching
+# clear_cache() # Uncomment this line to clear the cache
+logging.set_verbosity_debug()  # Uncomment this line to see verbose logs
+from flows import flow_verse
+dependencies = [
+    {"url": "aiflows/VisionFlowModule", "revision": os.getcwd()},
+]
+flow_verse.sync_dependencies(dependencies)
+if __name__ == "__main__":
+    # ~~~ Set the API information ~~~
+    # OpenAI backend
+    api_information = [ApiInfo(backend_used="openai",
+                              api_key = os.getenv("OPENAI_API_KEY"))]
+    # # Azure backend
+    # api_information = ApiInfo(backend_used = "azure",
+    #                           api_base = os.getenv("AZURE_API_BASE"),
+    #                           api_key = os.getenv("AZURE_OPENAI_KEY"),
+    #                           api_version =  os.getenv("AZURE_API_VERSION") )
+    root_dir = "."
+    cfg_path = os.path.join(root_dir, "demo.yaml")
+    cfg = read_yaml_file(cfg_path)
+    cfg["flow"]["backend"]["api_infos"] = api_information
+    # ~~~ Instantiate the Flow ~~~
+    flow_with_interfaces = {
+        "flow": hydra.utils.instantiate(cfg['flow'], _recursive_=False, _convert_="partial"),
+        "input_interface": (
+            None
+            if cfg.get( "input_interface", None) is None
+            else hydra.utils.instantiate(cfg['input_interface'], _recursive_=False)
+        ),
+        "output_interface": (
+            None
+            if cfg.get( "output_interface", None) is None
+            else hydra.utils.instantiate(cfg['output_interface'], _recursive_=False)
+        ),
+    }
+    url_image = {"type": "url",
+                 "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
+    local_image = {"type": "local_path", "image": "PATH TO YOUR LOCAL IMAGE"}
+    video = {"video_path": "PATH TO YOUR LOCAL VIDEO", "resize": 768, "frame_step_size": 30, "start_frame": 0, "end_frame": None }
+    # ~~~ Get the data ~~~
+    ## FOR SINGLE IMAGE
+    data = {"id": 0, "query": "What’s in this image?", "data": {"images": [url_image]}}  # This can be a list of samples
+    ## FOR MULTIPLE IMAGES
+    # data = {"id": 0, "question": "What are in these images? Is there any difference between them?",  "data": {"images": [url_image,local_image]}}  # This can be a list of samples
+    ## FOR VIDEO
+    # data = {"id": 0,
+    #         "question": "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
+    #         "data": {"video": video}}  # This can be a list of samples
+    # ~~~ Run inference ~~~
+    path_to_output_file = None
+    # path_to_output_file = "output.jsonl"  # Uncomment this line to save the output to disk
+    _, outputs = FlowLauncher.launch(
+        flow_with_interfaces=flow_with_interfaces,
+        data=data,
+        path_to_output_file=path_to_output_file
+    )
+    # ~~~ Print the output ~~~
+    flow_output_data = outputs[0]
+    print(flow_output_data)