Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git mlflow github
Dean P 336e6414fd
Merge pull request #10 from DagsHub/fix_yolo_image_dimensions
2 months ago
c92ccfd7c4
Added example notebook, and improve the pip install
9 months ago
99f9508769
Update yolov8_seg.py
2 months ago
d862954b61
initial commit
1 year ago
d14d6e9429
Update requirement versions, fill missing model name, and point OCR model to central repo
10 months ago
d862954b61
initial commit
1 year ago
2033d36503
Fix polygon segmentation
10 months ago
b47cfcf528
Add demo video to readme
9 months ago
d862954b61
initial commit
1 year ago
c92ccfd7c4
Added example notebook, and improve the pip install
9 months ago
790477bd2f
add some more logs to help debug when necessary
1 year ago
d6edfe6a7d
library structure, minor bugfix on orchestrator
1 year ago
aea81b88fa
Fix OCR model
10 months ago
d862954b61
initial commit
1 year ago
685556195c
resolved setter not found on pillow with YOLOv8
1 year ago
d862954b61
initial commit
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Banner

Introduction

Setting up auto labeling is a long and tedious process. Large parts can be automated if your model is connected via MLflow, and this project helps set such a system up in a few minutes.

Here's what it does:

  • Connects your MLflow-registered models directly with Label Studio for auto-labeling.
  • No need for complex backend setup or manual server configs.
  • Comes with pre-configured models for common tasks like polygon segmentation.
  • You can customize it with your own hooks.
  • Works out-of-the-box with DagsHub's free hosted MLflow and Label Studio

Demo video

Users have two points of injection: a post_hook and a pre_hook. The pre_hook takes as input a local filepath to the downloaded datapoint for annotation, which is then forwarded to the MLflow model for prediction, which finally is forwarded to the post_hook for conversion to the LS format.

The pre_hook is optional, and defaults to the identity function lambda x: x.

Steps for Setup

  1. Pull git submodules git submodule update --init.
  2. From the project root, build a docker container with the label 'configurable-ls-backend': docker build . -t configurable-ls-backend
  3. From here, you can either run a docker container, or a container orchestrator (multiple containers with multiple backends). a. Docker container: docker run configurable-ls-backend -p <port-of-choice>:9090 a. Orchestrator: flask --app orchestrator run

The backend is now ready. now we move to the client.

  1. pip install dagshub

Once this is working, you're ready to use any MLflow model as a LS backend. The last thing left to supply is hooks, one that processes filepaths into the desired input, and one that takes the predictions from an MLflow model and converts them into the LabelStudio format. Refer to the following section for details on building a post hook.

  1. Since datapoints (which are sent for annotation) are each associated with datasources, you must first initialize a datasource before you can add an annotation model to a desired annotation project.
In [1]: from dagshub.data_engine import datasources
In [2]: from hooks.polygon_segmentation import post_hook
In [3]: ds = datasources.get_datasource('username/repo', 'datasource_name')
  1. To add an annotation model, specify the repo it's registered under, the model name, as well as the post hook. This will supply an endpoint URL you can forward and add to LabelStudio. Optionally, you can also provide an ngrok token and a project name, and the client will forward and add the endpoint for you as well.
In [4]: ds.add_annotation_model('username/repo', 'model_name', post_hook)

For more information about additional options you can supply, refer to help(ds.add_annotation_model).

Building Post Hooks

The key task that remains is that of setting up a post_hook. This can be tricky, because failure is not always explicit. Refer to the following sections on tips for debugging, to ease that process.

The key idea is that the model expects a list of predictions for each annotation task (different image, different prediction),

A prediction consists of a dictionary containing result, score, and model_version keys.

The result key contains a list of results (e.g. multiple instances of an object on a single image), which further contain an id that must be generated randomly, information about the target, the type of the prediction, as well as the value of the prediction itself. While the values passed varies between tasks, the overall key structure is retained, and following it is crucial to having everything render correctly.

An example of a predictions JSON is as follows (points trimmed for convenience):

  "predictions": [
    {
      "id": 30,
      "model_version": "0.0.1",
      "created_ago": "23 hours, 41 minutes",
      "result": [
        {
          "id": "f346",
          "type": "polygonlabels",
          "value": {
            "score": 0.8430982828140259,
            "closed": true,
            "points": [ ... ],
            "polygonlabels": [
              "giraffe"
            ]
          },
          "to_name": "image",
          "readonly": false,
          "from_name": "label",
          "image_rotation": 0,
          "original_width": 426,
          "original_height": 640
        }
      ],
      "score": 0.8430982828140259,
      "cluster": null,
      "neighbors": null,
      "mislabeling": 0,
      "created_at": "2024-07-16T12:56:49.517014Z",
      "updated_at": "2024-07-16T12:56:49.517042Z",
      "task": 7,
      "project": 3
    }
  ]

Tips for Debugging & Hassle-Free Development

  1. Use label-studio-ml start . locally to not have to rebuild your docker container after every build.
  2. A local instance of Label Studio is helpful for understanding cases wherein the prediction did not render correctly. My recommendation is to inject IPython.embed() strategically within predict_tasks from label_studio.ml.models to identify if there's a discrepancy between what you expect and what you see. For this to works within model.py, change tasks from L30 to a list containing a path that you know contains valid targets. a. If you opt for this, use a separate virtual environment for label-studio.
  3. Remember that for cloudpickle to work, you need to have the docker container set up with the same version as that used to send the command to the container. Therefore, you may have to rebuild the container to match that. You can test if this is working correctly by running print(inspect.getsourcecode(self.post_hook)) in model.py.
  4. Include all your dependencies from hooks to the registered mlflow model.
  5. Ensure that the MLFlow model that you are running works dependency-free. Incorrectly registered models may be missing code files, not have a signature or may be missing dependencies.
  6. Once you initialize a docker container, running configure multiple times will reset it completely, and the docker container need not be restarted.
  7. For unknown json formats, you can use the task source </> button in LabelStudio's task view to reveal the source JSON, which you can use as a reference to build a functional prediction.
Tip!

Press p or to see the previous file or, n or to see the next file

About

A repo of models for Auto-labeling with Label Studio ML backend. Requires working with https://github.com/DagsHub/ls-configurable-model

Collaborators 3

Comments

Loading...