Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git
Eman22S 51f1690cdf
Moved notebook UNET-BERT pipeline to a different directory
1 year ago
b1073ed666
Added UNet-OCR for git tracking
1 year ago
0a44b637ed
Added Sixth experiment
1 year ago
0d56c6583a
Updated Fourth experiements
1 year ago
c35633f27b
git stop tracking trainset.zip
1 year ago
0a44b637ed
Added Sixth experiment
1 year ago
83d617a58d
Added dvc files
1 year ago
7b5d3dcd1c
Updated README.md file
1 year ago
51f1690cdf
Moved notebook UNET-BERT pipeline to a different directory
1 year ago
7d842062ea
Added sixth experiment
1 year ago
83d617a58d
Added dvc files
1 year ago
7d842062ea
Added sixth experiment
1 year ago
592ff6b5cc
Updated postprocess.py
1 year ago
83d617a58d
Added dvc files
1 year ago
0a44b637ed
Added Sixth experiment
1 year ago
7fe51c921a
Added Model and unetpred
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Overview

This project extracts important sections from scientific papers and converts them into text. The sections we chose to be extracted are Title, Abstract, Author/s. So far I've used UNet and OCR.The UNet is used to extract the sections.The OCR converts it into text.The complete and final version includes BERT to do the text summarization. Full tutorial here.

Installation

  1. Start by cloning my repository.
git clone https://dagshub.com/Eman22S/Unet-OCR-2.0.git
  1. Install dvc to track your dataset
pip install dvc
  1. Install dagshub for data, model and experiement logging/tracking.
pip install dagshub
  1. Configure your remote dvc as follows. Dagshub will auto generate the content in braces for you.
dvc remote add origin https://dagshub.com/Eman22S/Unet-OCR-2.0.dvc
dvc remote modify origin — local auth basic
dvc remote modify origin — local user {your-username}
dvc remote modify origin — local password {your-password}
  1. if you want to pull my Model and dataset run this code. Otherwise skip
dvc pull -r origin
  1. Install tesseract on your machine.
sudo apt install tesseract-ocr-all -y

Running

Make sure you changed directory to installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet.Also make sure you’re images and masks are in installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet/data/imgs and installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet/data/masks respectively. Or run dvc pull -r origin to use my training dataset.

  1. Start by training your model using:

python train.py --amp 2. Run prediction using the following code:

python predict.py -i {path to testset} -o {path to save generated file}
  1. Convert predicted mask into Image:
python postprocess.py -i {path to masked image} -e {path to original image} -o {path to save output}
  1. Image to text conversion using tesseract:
tesseract {path to the postprocessed image} {path with name of file to save the text } --l eng

Limitations

The dataset this project is trained on is of first page images of scientific papers. if you want to train on your dataset, make sure you convert it to jpg images.

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 2

Comments

Loading...