Are you sure you want to delete this access key?
If you have a project that ingests a lot of annotation-requiring data (e.g. any Active Learning project), manually labeling everything gets tedious quickly. One way to ease that strain is to use an ML model to predict the labels, so you just have to cross-check that everything is correct!
DagsHub supports this functionality through LabelStudio's ML Backend - you can add backends to annotation projects, and auto-label away.
Unfortunately, setting up LabelStudio Backends are tedious and sometimes hard to debug. However, if you have a model registered on DagsHub via MLflow, we can simplify a large chunk of that process, using a combination of our client library and a configurable MLflow-first backend.
Besides information about the model location, name and version, two functions can be supplied to create an auto-labeler, a post_hook
and a pre_hook
:
pre_hook
takes as input a local filepath to the downloaded datapoint for annotation, which is then forwarded to the MLflow model for prediction.post_hook
for conversion to the LS format.!!! info
The pre_hook
and post_hook
arguments are optional, and default to the identity function (lambda x: x
).
!!! tip
Both hooks are sent using cloudpickle
. For this to work, the python version of the docker container should match
that of the client. If you get SIGSEGV
during inference or cloudpickle
errors, a discrepancy between versions is likely the root cause.
Clone and enter the repository:
git clone https://github.com/DagsHub/ls-configurable-model ; cd ls-configurable-model
Pull git submodules:
git submodule update --init
From the project root, build a docker container with the label 'configurable-ls-backend':
docker build . -t configurable-ls-backend
From here, you can either run a docker container, or a container orchestrator (multiple containers with multiple backends).
Docker container:
docker run -p 9999:9090 configurable-ls-backend
!!! info 9999 is randomly selected. You can change to a different port if necessary
Orchestrator:
flask --app orchestrator run
(Optional) If you have any configurable environment variables, copy those to the docker container:
docker cp path/to/.env <container-id>:/app/
These will automatically be loaded into the runtime environment before the mlflow model is called.
The backend is now ready. now we move to the client. We can install it using:
pip install "git+https://github.com/DagsHub/client.git@ls-remote+mlflow#egg=dagshub[autolabelling]"
Once this is working, you're ready to use any MLflow model as a LS backend. The last thing left to supply is our hooks, one that processes filepaths into the desired input, and one that takes the predictions from an MLflow model and converts them into the LabelStudio format. Refer to the following section for details on building a post hook.
Since datapoints (which are sent for annotation) are each associated with datasources, you must first initialize a datasource before you can add an annotation model to a desired annotation project.
from dagshub.data_engine import datasources
ds = datasources.get_datasource('username/repo', 'datasource_name')
To add an annotation model, specify the repo it's registered under, the model name, as well as the post hook. This will supply an endpoint URL you can forward and add to LabelStudio. Optionally, you can also provide an ngrok token and a project name, and the client will forward and add the endpoint for you as well.
ds.add_annotation_model('username/repo', 'model_name', post_hook=<post_hook_function>)
!!! example
python !docker run -p 9999:9090 configurable-ls-backend from dagshub.data_engine import datasources from preconfigured_models.image_to_image.polygon_segmentation.yolov8 import post_hook # yolov8.py -> https://github.com/DagsHub/ls-configurable-model/blob/main/preconfigured_models/image_to_image/polygon_segmentation/yolov8.py ds = datasources.get_datasource('jinensetpal/COCO_1K', 'COCO_1K') ds.add_annotation_model('jinensetpal/COCO_1K', 'yolov8-seg', version='4', # Model version in MLflow model registry post_hook=post_hook, pre_hook = lambda x: x, port=9999, project_name='<label_project_name>', # in this example the label_project_name has to be a pre-existing label project ngrok_authtoken='<ngrok_token>')
For more information about optional arguments, refer to docstrings: help(ds.add_annotation_model)
.
!!! info
If you plan to run your annotator locally, you can skip steps 1-5 (creating a LS backend), and call query_result.annotate_with_mlflow_model
to directly upload LS annotations to the Data Engine.
The key task that remains is that of setting up a post_hook
. This can be tricky, because failure is not always explicit. Refer to the following sections on tips for debugging, to ease that process.
The key idea is that the model expects a list of predictions for each annotation task (different image, different prediction).
A prediction consists of a dictionary containing result
, score
, and model_version
keys.
The result
key contains a list of results (e.g. multiple instances of an object on a single image), which further contain an id
that must be generated randomly, information about the target, the type of the prediction, as well as the value of the prediction itself. While the values passed varies between tasks, the overall key structure is retained, and following it is crucial to having everything render correctly.
An example of polygon segmentation's prediction JSON is as follows (points trimmed for convenience):
"predictions": [
{
"id": 30,
"model_version": "0.0.1",
"created_ago": "23 hours, 41 minutes",
"result": [
{
"id": "f346",
"type": "polygonlabels",
"value": {
"score": 0.8430982828140259,
"closed": true,
"points": [
[
60.15625,
14.553991317749023
],
[
60.15625,
16.19718360900879
],
...,
]
"polygonlabels": [
"giraffe"
]
},
"to_name": "image",
"readonly": false,
"from_name": "label",
"image_rotation": 0,
"original_width": 426,
"original_height": 640
}
],
"score": 0.8430982828140259,
"cluster": null,
"neighbors": null,
"mislabeling": 0,
"created_at": "2024-07-16T12:56:49.517014Z",
"updated_at": "2024-07-16T12:56:49.517042Z",
"task": 7,
"project": 3
}
]
To get started quickly with auto-labeling, you can also use our library of pre-configured models that can be used with basically one line of Python.
It automatically sets up and configures a LabelStudio project, configures the backend and adds it to the LabelStudio project, ready for auto-labeling!
First, the general setup remains the same; follow steps 1-6. You should have a datasource to which you'd like to add the auto-labeling backend.
Next, find the task you'd like to run and follow the code snippet:
from preconfigured_models.image_to_image.polygon_segmentation import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional
from preconfigured_models.audio_to_text.automatic_speech_recognition import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional
from preconfigured_models.image_to_text.ocr import get_config
ds.add_annotation_model_from_config(get_config(),
project_name='<label_project_name>', # if this project exists, it will overrwrite the config, otherwise it will initialize the project and setup the config
ngrok_authtoken='<ngrok_token>',
port='<ls_configurable_backend_port>') # optional)
Once the process runs, you should be good to go!
label-studio-ml start .
locally to not have to rebuild your docker container after every build.IPython.embed()
strategically within predict_tasks
from label_studio.ml.models
to identify if there's a discrepancy between what you expect and what you see. For this to works within model.py
, change tasks
from L30 to a list containing a path that you know contains valid targets. If you opt for this, use a separate virtual environment for label-studio.print(inspect.getsource(self.post_hook))
to model.py
.cloudpickle.register_pickle_by_value(module)
to ensure it does not forward the reference and fail as a result.docker run -p <port-of-choice>:9090 -it configurable-ls-backend
). Alternatively, you can also initialize the container normally, and then follow logs: docker logs -f <container-id>
... -t configurable-ls-backend:3.12
, and accordingly run configurable-ls-backend:3.12
, to avoid rebuilding for every use case.-v /path/to/user/.cache/uv:/root/.cache/uv
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?