Auto-Labeling using ML Models¶
If you have a project that ingests a lot of annotation-requiring data (e.g. any Active Learning project), manually labeling everything gets tedious quickly. One way to ease that strain is to use an ML model to predict the labels, so you just have to cross-check that everything is correct!
DagsHub supports this functionality through LabelStudio's ML Backend - you can add backends to annotation projects, and auto-label away.
Unfortunately, setting up LabelStudio Backends are tedious and sometimes hard to debug. However, if you have a model registered on DagsHub via MLflow, we can simplify a large chunk of that process, using a combination of our client library and a configurable MLflow-first backend.
Besides information about the model location, name and version, two functions can be supplied to create an auto-labeler, a post_hook
and a pre_hook
:
- The
pre_hook
takes as input a local filepath to the downloaded datapoint for annotation, which is then forwarded to the MLflow model for prediction. - The output from the MLflow model is then forwarded to the
post_hook
for conversion to the LS format.
Info
The pre_hook
and post_hook
arguments are optional, and default to the identity function (lambda x: x
).
Tip
Both hooks are sent using cloudpickle
. For this to work, the python version of the docker container should match
that of the client. If you get SIGSEGV
during inference or cloudpickle
errors, a discrepancy between versions is likely the root cause.
Setting up your Auto-labeling Backend¶
- Clone and enter the repository:
git clone https://github.com/DagsHub/ls-configurable-model ; cd ls-configurable-model
- Pull git submodules:
git submodule update --init
- From the project root, build a docker container with the label 'configurable-ls-backend':
docker build . -t configurable-ls-backend
-
From here, you can either run a docker container, or a container orchestrator (multiple containers with multiple backends).
-
Docker container:
docker run -p 9090:9090 configurable-ls-backend
Info
9090 is selected as default. You can change to a different port if necessary
-
Orchestrator:
flask --app orchestrator run
-
-
(Optional) If you have any configurable environment variables, copy those to the docker container:
These will automatically be loaded into the runtime environment before the mlflow model is called.docker cp path/to/.env <container-id>:/app/
-
The backend is now ready. now we move to the client. We can install it using:
pip install dagshub
Once this is working, you're ready to use any MLflow model as a LS backend. The last thing left to supply is our hooks, one that processes filepaths into the desired input, and one that takes the predictions from an MLflow model and converts them into the LabelStudio format. Refer to the following section for details on building a post hook.
-
Since datapoints (which are sent for annotation) are each associated with datasources, you must first initialize a datasource before you can add an annotation model to a desired annotation project.
from dagshub.data_engine import datasources ds = datasources.get_datasource('username/repo', 'datasource_name')
-
To add an annotation model, specify the repo it's registered under, the model name, as well as the post hook. This will supply an endpoint URL you can forward and add to LabelStudio. Optionally, you can also provide an ngrok token and a project name, and the client will forward and add the endpoint for you as well.
ds.add_annotation_model('username/repo', 'model_name', post_hook=<post_hook_function>)
Example
!docker run -p 9999:9090 configurable-ls-backend
from dagshub.data_engine import datasources
from preconfigured_models.image_to_image.polygon_segmentation.yolov8 import post_hook # yolov8.py -> https://github.com/DagsHub/ls-configurable-model/blob/main/preconfigured_models/image_to_image/polygon_segmentation/yolov8.py
ds = datasources.get_datasource('jinensetpal/COCO_1K', 'COCO_1K')
ds.add_annotation_model('jinensetpal/COCO_1K', 'yolov8-seg',
version='4', # Model version in MLflow model registry
post_hook=post_hook,
pre_hook = lambda x: x, port=9999,
project_name='<label_project_name>', # in this example the label_project_name has to be a pre-existing label project
ngrok_authtoken='<ngrok_token>')
For more information about optional arguments, refer to docstrings: help(ds.add_annotation_model)
.
Info
If you plan to run your annotator locally, you can skip steps 1-5 (creating a LS backend), and call query_result.annotate_with_mlflow_model
to directly upload LS annotations to the Data Engine.
Building Post Hooks¶
The key task that remains is that of setting up a post_hook
. This can be tricky, because failure is not always explicit. Refer to the following sections on tips for debugging, to ease that process.
The key idea is that the model expects a list of predictions for each annotation task (different image, different prediction).
A prediction consists of a dictionary containing result
, score
, and model_version
keys.
The result
key contains a list of results (e.g. multiple instances of an object on a single image), which further contain an id
that must be generated randomly, information about the target, the type of the prediction, as well as the value of the prediction itself. While the values passed varies between tasks, the overall key structure is retained, and following it is crucial to having everything render correctly.
An example of polygon segmentation's prediction JSON is as follows (points trimmed for convenience):
"predictions": [
{
"id": 30,
"model_version": "0.0.1",
"created_ago": "23 hours, 41 minutes",
"result": [
{
"id": "f346",
"type": "polygonlabels",
"value": {
"score": 0.8430982828140259,
"closed": true,
"points": [
[
60.15625,
14.553991317749023
],
[
60.15625,
16.19718360900879
],
...,
]
"polygonlabels": [
"giraffe"
]
},
"to_name": "image",
"readonly": false,
"from_name": "label",
"image_rotation": 0,
"original_width": 426,
"original_height": 640
}
],
"score": 0.8430982828140259,
"cluster": null,
"neighbors": null,
"mislabeling": 0,
"created_at": "2024-07-16T12:56:49.517014Z",
"updated_at": "2024-07-16T12:56:49.517042Z",
"task": 7,
"project": 3
}
]
Using Pre-Configured Models¶
To get started quickly with auto-labeling, you can also use our library of pre-configured models that can be used with basically one line of Python.
It automatically sets up and configures a LabelStudio project, configures the backend and adds it to the LabelStudio project, ready for auto-labeling!
First, the general setup remains the same; follow steps 1-6. You should have a datasource to which you'd like to add the auto-labeling backend.
Next, find the task you'd like to run and follow the code snippet:
Polygon Segmentation¶
from preconfigured_models.image_to_image.polygon_segmentation import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional
Automatic Speech Recognition¶
from preconfigured_models.audio_to_text.automatic_speech_recognition import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional
Optical Character Recognition¶
from preconfigured_models.image_to_text.ocr import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional)
Once the process runs, you should be good to go!
Tips for Debugging & Hassle-Free Development¶
- Use
label-studio-ml start .
locally to not have to rebuild your docker container after every build. - A local instance of Label Studio is helpful for understanding cases where predictions do not render correctly. We recommend injecting
IPython.embed()
strategically withinpredict_tasks
fromlabel_studio.ml.models
to identify if there's a discrepancy between what you expect and what you see. For this to works withinmodel.py
, changetasks
from L30 to a list containing a path that you know contains valid targets. If you opt for this, use a separate virtual environment for label-studio. - Remember that for cloudpickle to work, you need to have the docker container set up with the same version as that used to send the command to the container. Therefore, you may have to rebuild the container to match that. You can test if this is working correctly by adding
print(inspect.getsource(self.post_hook))
tomodel.py
. - Include all your dependencies from hooks to the registered mlflow model.
- Ensure that the MLFlow model that you are running works dependency-free. Incorrectly registered models may be missing code files, not have a signature or may be missing dependencies.
- Once you initialize a docker container, running configure multiple times will reset it completely, and the docker container need not be restarted.
- For unknown json formats, you can use the task source "</>" button in LabelStudio's task view to reveal the source JSON, which you can use as a reference to build a functional prediction.
- In case you are importing a module as your hook, use
cloudpickle.register_pickle_by_value(module)
to ensure it does not forward the reference and fail as a result. - It helps to follow the internal logs of the container. You can do this by setting the container to interactively with a pseudo tty output (
docker run -p <port-of-choice>:9090 -it configurable-ls-backend
). Alternatively, you can also initialize the container normally, and then follow logs:docker logs -f <container-id>
- To manage multiple python versions of the docker container, you can add labels to the tagged builds with the python version, ex.
... -t configurable-ls-backend:3.12
, and accordingly runconfigurable-ls-backend:3.12
, to avoid rebuilding for every use case. - Instead of having docker install packages every time you re-run the container, you can mount a volume to share cache with your local machine, by passing
-v /path/to/user/.cache/uv:/root/.cache/uv