Skip to content

Auto-Labeling using ML Models

If you have a project that ingests a lot of annotation-requiring data (e.g. any Active Learning project), manually labeling everything gets tedious quickly. One way to ease that strain is to use an ML model to predict the labels, so you just have to cross-check that everything is correct!

DagsHub supports this functionality through LabelStudio's ML Backend - you can add backends to annotation projects, and auto-label away.

Unfortunately, setting up LabelStudio Backends are tedious and sometimes hard to debug. However, if you have a model registered on DagsHub via MLflow, we can simplify a large chunk of that process, using a combination of our client library and a configurable MLflow-first backend.

Besides information about the model location, name and version, two functions can be supplied to create an auto-labeler, a post_hook and a pre_hook:

  1. The pre_hook takes as input a local filepath to the downloaded datapoint for annotation, which is then forwarded to the MLflow model for prediction.
  2. The output from the MLflow model is then forwarded to the post_hook for conversion to the LS format.

Info

The pre_hook and post_hook arguments are optional, and default to the identity function (lambda x: x).

Tip

Both hooks are sent using cloudpickle. For this to work, the python version of the docker container should match that of the client. If you get SIGSEGV during inference or cloudpickle errors, a discrepancy between versions is likely the root cause.

Setting up your Auto-labeling Backend

  1. Clone and enter the repository:
    git clone https://github.com/DagsHub/ls-configurable-model ; cd ls-configurable-model
    
  2. Pull git submodules:
    git submodule update --init
    
  3. From the project root, build a docker container with the label 'configurable-ls-backend':
    docker build . -t configurable-ls-backend
    
  4. From here, you can either run a docker container, or a container orchestrator (multiple containers with multiple backends).

    1. Docker container:

      docker run -p 9090:9090 configurable-ls-backend
      

      Info

      9090 is selected as default. You can change to a different port if necessary

    2. Orchestrator:

      flask --app orchestrator run
      

  5. (Optional) If you have any configurable environment variables, copy those to the docker container:

    docker cp path/to/.env <container-id>:/app/
    
    These will automatically be loaded into the runtime environment before the mlflow model is called.

  6. The backend is now ready. now we move to the client. We can install it using:

    pip install dagshub
    

    Once this is working, you're ready to use any MLflow model as a LS backend. The last thing left to supply is our hooks, one that processes filepaths into the desired input, and one that takes the predictions from an MLflow model and converts them into the LabelStudio format. Refer to the following section for details on building a post hook.

  7. Since datapoints (which are sent for annotation) are each associated with datasources, you must first initialize a datasource before you can add an annotation model to a desired annotation project.

    from dagshub.data_engine import datasources
    ds = datasources.get_datasource('username/repo', 'datasource_name')
    

  8. To add an annotation model, specify the repo it's registered under, the model name, as well as the post hook. This will supply an endpoint URL you can forward and add to LabelStudio. Optionally, you can also provide an ngrok token and a project name, and the client will forward and add the endpoint for you as well.

    ds.add_annotation_model('username/repo', 'model_name', post_hook=<post_hook_function>)
    

Example

!docker run -p 9999:9090 configurable-ls-backend
from dagshub.data_engine import datasources
from preconfigured_models.image_to_image.polygon_segmentation.yolov8 import post_hook # yolov8.py -> https://github.com/DagsHub/ls-configurable-model/blob/main/preconfigured_models/image_to_image/polygon_segmentation/yolov8.py
ds = datasources.get_datasource('jinensetpal/COCO_1K', 'COCO_1K')
ds.add_annotation_model('jinensetpal/COCO_1K', 'yolov8-seg',
                        version='4', # Model version in MLflow model registry 
                        post_hook=post_hook,
                        pre_hook = lambda x: x, port=9999,
                        project_name='<label_project_name>', # in this example the label_project_name has to be a pre-existing label project
                        ngrok_authtoken='<ngrok_token>')

For more information about optional arguments, refer to docstrings: help(ds.add_annotation_model).

Info

If you plan to run your annotator locally, you can skip steps 1-5 (creating a LS backend), and call query_result.annotate_with_mlflow_model to directly upload LS annotations to the Data Engine.

Building Post Hooks

The key task that remains is that of setting up a post_hook. This can be tricky, because failure is not always explicit. Refer to the following sections on tips for debugging, to ease that process.

The key idea is that the model expects a list of predictions for each annotation task (different image, different prediction).

A prediction consists of a dictionary containing result, score, and model_version keys.

The result key contains a list of results (e.g. multiple instances of an object on a single image), which further contain an id that must be generated randomly, information about the target, the type of the prediction, as well as the value of the prediction itself. While the values passed varies between tasks, the overall key structure is retained, and following it is crucial to having everything render correctly.

An example of polygon segmentation's prediction JSON is as follows (points trimmed for convenience):

  "predictions": [
    {
      "id": 30,
      "model_version": "0.0.1",
      "created_ago": "23 hours, 41 minutes",
      "result": [
        {
          "id": "f346",
          "type": "polygonlabels",
          "value": {
            "score": 0.8430982828140259,
            "closed": true,
            "points": [
                [
                    60.15625,
                    14.553991317749023
                ],
                [
                    60.15625,
                    16.19718360900879
                ],
                ...,
            ]
            "polygonlabels": [
              "giraffe"
            ]
          },
          "to_name": "image",
          "readonly": false,
          "from_name": "label",
          "image_rotation": 0,
          "original_width": 426,
          "original_height": 640
        }
      ],
      "score": 0.8430982828140259,
      "cluster": null,
      "neighbors": null,
      "mislabeling": 0,
      "created_at": "2024-07-16T12:56:49.517014Z",
      "updated_at": "2024-07-16T12:56:49.517042Z",
      "task": 7,
      "project": 3
    }
  ]

Using Pre-Configured Models

To get started quickly with auto-labeling, you can also use our library of pre-configured models that can be used with basically one line of Python.

It automatically sets up and configures a LabelStudio project, configures the backend and adds it to the LabelStudio project, ready for auto-labeling!

First, the general setup remains the same; follow steps 1-6. You should have a datasource to which you'd like to add the auto-labeling backend.

Next, find the task you'd like to run and follow the code snippet:

Polygon Segmentation

from preconfigured_models.image_to_image.polygon_segmentation import get_config
ds.add_annotation_model_from_config(get_config(),
                                    project_name='<label_project_name>',  # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
                                    ngrok_authtoken='<ngrok_token>',
                                    port='<ls_configurable_backend_port>') # optional

Automatic Speech Recognition

from preconfigured_models.audio_to_text.automatic_speech_recognition import get_config
ds.add_annotation_model_from_config(get_config(),
                                    project_name='<label_project_name>',  # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
                                    ngrok_authtoken='<ngrok_token>',
                                    port='<ls_configurable_backend_port>') # optional

Optical Character Recognition

from preconfigured_models.image_to_text.ocr import get_config
ds.add_annotation_model_from_config(get_config(),
                                    project_name='<label_project_name>',  # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
                                    ngrok_authtoken='<ngrok_token>',
                                    port='<ls_configurable_backend_port>') # optional)

Once the process runs, you should be good to go!

Tips for Debugging & Hassle-Free Development

  1. Use label-studio-ml start . locally to not have to rebuild your docker container after every build.
  2. A local instance of Label Studio is helpful for understanding cases where predictions do not render correctly. We recommend injecting IPython.embed() strategically within predict_tasks from label_studio.ml.models to identify if there's a discrepancy between what you expect and what you see. For this to works within model.py, change tasks from L30 to a list containing a path that you know contains valid targets. If you opt for this, use a separate virtual environment for label-studio.
  3. Remember that for cloudpickle to work, you need to have the docker container set up with the same version as that used to send the command to the container. Therefore, you may have to rebuild the container to match that. You can test if this is working correctly by adding print(inspect.getsource(self.post_hook)) to model.py.
  4. Include all your dependencies from hooks to the registered mlflow model.
  5. Ensure that the MLFlow model that you are running works dependency-free. Incorrectly registered models may be missing code files, not have a signature or may be missing dependencies.
  6. Once you initialize a docker container, running configure multiple times will reset it completely, and the docker container need not be restarted.
  7. For unknown json formats, you can use the task source "</>" button in LabelStudio's task view to reveal the source JSON, which you can use as a reference to build a functional prediction.
  8. In case you are importing a module as your hook, use cloudpickle.register_pickle_by_value(module) to ensure it does not forward the reference and fail as a result.
  9. It helps to follow the internal logs of the container. You can do this by setting the container to interactively with a pseudo tty output (docker run -p <port-of-choice>:9090 -it configurable-ls-backend). Alternatively, you can also initialize the container normally, and then follow logs: docker logs -f <container-id>
  10. To manage multiple python versions of the docker container, you can add labels to the tagged builds with the python version, ex. ... -t configurable-ls-backend:3.12, and accordingly run configurable-ls-backend:3.12, to avoid rebuilding for every use case.
  11. Instead of having docker install packages every time you re-run the container, you can mount a volume to share cache with your local machine, by passing -v /path/to/user/.cache/uv:/root/.cache/uv