Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  llm generative ai generative-ai instagram vlm social network meta Type:  dataset Integration:  dvc git
41434b27bf
Initial commit
4 months ago
417c14420c
Re-add processed data to DVC
4 months ago
0c5e15567b
Refactor Jupyter notebooks for COCO dataset exploration and add utility scripts for data processing
4 months ago
41434b27bf
Initial commit
4 months ago
28862a069f
Add detailed README and initial project structure for AICaptionIt
4 months ago
src
0c5e15567b
Refactor Jupyter notebooks for COCO dataset exploration and add utility scripts for data processing
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
28862a069f
Add detailed README and initial project structure for AICaptionIt
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
28862a069f
Add detailed README and initial project structure for AICaptionIt
4 months ago
41434b27bf
Initial commit
4 months ago
41434b27bf
Initial commit
4 months ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

AICaptionIt

This document outlines in detail the thoughts, questions, and technical approaches discussed around the project for generating captions for Instagram posts containing multiple photos. The overall idea is to build a multimodal model (capable of interpreting images) that generates three captions in different styles (Chill, Fun, Minimalist), helping users quickly find inspiration for their posts.


1. Problem Statement and Concept

  • Observation: On Instagram, creating visuals has become easy, but finding captions that drive engagement remains difficult.
  • Goal: Develop a tool that takes 1 to 10 photos from the same post and instantly generates 3 captions in distinct tones:
    1. Chill: relaxed, poetic, contemplative
    2. Fun: humorous, lighthearted, dynamic
    3. Minimalist: very concise, punchy, aesthetic

Use Case

A user is preparing a beach vacation carousel: beach, cocktails, friends… The tool returns three tailored captions, ready to copy and paste into the post.


2. Technical Architecture

2.1 Model Selection: LLaVA or Equivalent Approach

  • LLaVA (Large Language and Vision Assistant) combines:
    • A visual encoder (typically CLIP ViT-L) to understand images
    • A LLM (e.g., LLaMA 7B, Mistral 7B, etc.) to generate text responses
  • All-in-one approach: The model receives images (or their embeddings) and directly returns captions.
  • Fine-tuning considered via QLoRA to run on a local GPU (e.g., RTX 4070, 12GB VRAM).

2.2 Processing X Photos

  1. Encoding: Each photo is encoded using CLIP.
  2. Fusion: Embeddings of X photos are aggregated (mean or concatenation + projection).
  3. Generation: The LLM generates 3 captions based on a prompt specifying the desired styles.

3. Datasets

3.1 Analysis of Three Candidates

  • COCO Captions

    • ~123,287 images, 5 captions per image
    • Everyday scenes, objects, people, animals
    • Relevance: Excellent foundation for caption generation
  • Flickr30K

    • ~31,783 images, 5 captions per image
    • Focused on human scenes, more varied interactions
    • Relevance: Complementary to COCO, more people-centric images
  • MNIST

    • Dataset of handwritten digits (0–9)
    • Not relevant for multimodal captioning

3.2 Caption Enrichment: Automatic Style Generation

COCO and Flickr30K datasets provide descriptive captions, not “stylized” ones. For CAPTIONIZER, 3 stylized variations are required.

  • Problem: No “style labels” in COCO or Flickr.
  • Solution: Use a LLM (e.g., GPT, Mistral) to automatically rewrite each caption into 3 versions:
    • Chill (poetic, contemplative)
    • Fun (humorous)
    • Minimalist (very short, impactful)

Why do this?

  • Manual annotation is time-consuming and expensive.
  • Using an LLM for paraphrasing allows the creation of a stylized dataset at scale and low cost.

3.3 Cost and Duration

  • Using an API (e.g., Perplexity “sonar-small-chat”): ~$0.0001 per triplet
    • 1,000 captions: $0.10
    • Full COCO (~600k captions): ~$60
  • Local model with Ollama (e.g., Mistral 7B or OpenHermes 2.5 quantized)
    • Zero token cost once downloaded
    • Inference time ~1.5 seconds per prompt (3 variations), ~51h for 123k images (full COCO)
    • Can parallelize (multiprocessing) to speed things up

4. Building a Reduced and Balanced Dataset

To avoid long generation times, it’s recommended to create a mini-dataset of ~5,000 examples, mixing 60% COCO and 40% Flickr30K for visual diversity.

  • Examples: 3,000 “posts” from COCO, 2,000 from Flickr30K
  • Each post: 1 to 5 grouped images to simulate an Instagram carousel
  • Captions: 1 original + 2 rewrites = 3 final styles, stored in JSON

5. Fine-Tuning Methods (on RTX 4070)

5.1 QLoRA (Quantized Low-Rank Adaptation)

  • Enables fine-tuning a 7B model on a 12GB GPU through quantization (Q4 / Q5)
  • PEFT (Parameter-Efficient Fine-Tuning): only a few layers are trained, saving VRAM

5.2 Process

  1. Preprocessing: encode/store image embeddings (CLIP ViT-L)

  2. Integration: inject embeddings as visual context

  3. Instruction Tuning:

    • Prompt style:
      Here is a series of X images. Generate 3 captions:
      1. Chill
      2. Fun
      3. Minimalist
      
    • With the actual stylized caption as label
  4. Evaluation: On a validation set (~500 examples), measure BLEU / ROUGE / CIDEr or perplexity


6. Local Tools (Ollama, Mistral, OpenHermes)

Why these tools?

  • Ollama: runs quantized models locally (Q4_K_M, Q5_K_M…) for inference
  • Mistral 7B: great performance-to-size ratio, ideal for paraphrasing
  • OpenHermes 2.5: fine-tuned for instruction tasks, strong style understanding, stable for local use

Estimated Local Inference Time

  • ~1.5s per prompt with Mistral 7B (Q4/Q5) on RTX 4070
  • Multiply by the number of captions to generate

7. Summary Action Plan

  1. Select a subset of COCO + Flickr30K (~5,000 examples)
  2. Group images into posts (1 to 5 images each)
  3. Extract 1 caption per post (original text)
  4. Rewrite: use LLM (API or local via Ollama) to produce 3 styles
  5. Format: create a clean JSON or JSONL, e.g.:
    {
      "images": ["imgA.jpg", "imgB.jpg"],
      "caption_chill": "...",
      "caption_fun": "...",
      "caption_minimalist": "..."
    }
    
  6. Fine-tuning: LLaVA (or equivalent) + QLoRA on the stylized dataset
  7. Validation: evaluate captioning scores (BLEU/ROUGE/CIDEr) or conduct user testing

8. Strengths & Limitations

Strengths

  • Turnkey solution: upload photos → get 3 caption styles
  • Cost: low or zero (local) for generating a stylized dataset
  • Technology: open-source, replicable, VRAM-friendly (QLoRA)
  • Impact: makes Instagram content creation easier

Limitations

  • Authenticity: automatic rewrites may lack a “human touch”
  • Interpretability: multimodal models are harder to debug than modular pipelines
  • Generation time: full dataset generation can be slow (a reduced subset is advised)

9. Next Steps

  1. Prototype a script that:
    • Samples images from COCO/Flickr
    • Generates 3 styles via Ollama or an API
    • Stores everything in final JSON
  2. Fine-tune LLaVA or an equivalent model with QLoRA on the RTX 4070
  3. Test on a sample of real Instagram posts to validate caption quality
  4. Iterate on quality (prompt engineering, tone adjustments, adding hashtags, etc.)

Conclusion

The CAPTIONIZER project offers a concrete solution to the challenge of generating stylized captions for Instagram. Leveraging enriched datasets (COCO, Flickr30K) and using APIs or local models (Ollama + Mistral/OpenHermes), a complete and cost-effective pipeline can be built.

This solution enables rapid experimentation and easy scaling, all within hardware constraints (a single RTX 4070 is enough). Best of all, it avoids expensive manual annotation thanks to the generative power of modern models.

For any questions or requests for scripts/tutorials on setup, fine-tuning, or dataset enrichment, refer to the detailed sections or contact the project team.


Let me know if you'd like this as a Markdown or text file!

Tip!

Press p or to see the previous file or, n or to see the next file

About

Caption generation for Instagram posts with multiple photos. The overall idea is to create a multimodal model (capable of interpreting images) and generate three different style captions (Chill, Fun, Minimalist) to help users quickly find inspiration for their posts.

Collaborators 1

Comments

Loading...