Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
8b51abcec6
Initial commit
1 year ago
fc8d4b9c65
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

SUCHO Ukrainian Cultural Heritage Web Archives

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/sucho-dataset")

fs.listdir("s3://sucho-opendata")

Description:

The dataset contains web archives of Open Access collections of digitised cultural heritage from more than 3,000+ websites of Ukrainian cultural institutions, such as museums, libraries or archives. The web archives have been produced by SUCHO, which is a volunteer group of more than 1,300 international cultural heritage professionals – librarians, archivists, researchers, programmers - who have joined forces to save as much digitised cultural heritage during the 2022 invasion of Ukraine before the servers hosting them get destroyed, damaged or go offline for any other reason. The web archives were created using the tools of the Webrecorder Open Source project in the open WACZ format: https://webrecorder.github.io/wacz-spec/1.1.1/. WACZ files are zipped containers of WARC (Web Archive Format) files enriched with metadata, which can contain several crawls in a single file. The file sizes can range from a few MBs to several TBs.

Contact:

The dataset contains web archives of Open Access collections of digitised cultural heritage from more than 3,000+ websites of Ukrainian cultural institutions, such as museums, libraries or archives. The web archives have been produced by SUCHO, which is a volunteer group of more than 1,300 international cultural heritage professionals – librarians, archivists, researchers, programmers - who have joined forces to save as much digitised cultural heritage during the 2022 invasion of Ukraine before the servers hosting them get destroyed, damaged or go offline for any other reason. The web archives were created using the tools of the Webrecorder Open Source project in the open WACZ format: https://webrecorder.github.io/wacz-spec/1.1.1/. WACZ files are zipped containers of WARC (Web Archive Format) files enriched with metadata, which can contain several crawls in a single file. The file sizes can range from a few MBs to several TBs.

Update Frequency:

Periodically

Managed By:

SUCHO

Resources:

  1. resource:
    • Description: WACZ archives
    • ARN: arn:aws:s3:::sucho-opendata
    • Region: eu-central-1
    • Type: S3 Bucket

Tags:

ukraine, internet, cultural preservation, aws-pds

Tip!

Press p or to see the previous file or, n or to see the next file

About

sucho-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...