Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
4201eaa86f
Update deployment process and scripts
2 years ago
f09795d564
Removed local config due to setting via command
2 years ago
bin
055dbe2677
remove catboost folder
2 years ago
d4313635a6
Pipeline for preprocessing added
2 years ago
c97b5beda1
First catboost model
2 years ago
6bb8909617
First commit
2 years ago
6bb8909617
First commit
2 years ago
src
dd61f4f90a
Adapted input modality for conuming endpoint
2 years ago
588c1efc0b
Init dvc
2 years ago
4201eaa86f
Update deployment process and scripts
2 years ago
a78ff0bd06
Add dataset with dvc
2 years ago
6bb8909617
First commit
2 years ago
6bb8909617
First commit
2 years ago
c6bcee0e22
Update 'README.md'
2 years ago
bcb12abe59
Initial deployment set up
2 years ago
0d702d6f5f
Reproduced the project on another client
2 years ago
0d702d6f5f
Reproduced the project on another client
2 years ago
0d702d6f5f
Reproduced the project on another client
2 years ago
0d702d6f5f
Reproduced the project on another client
2 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

german_fake_news_classifier

Overview

This repo is about classifying german news/fake-news. The objective is to train a binary classifier that is capable of classifying news articles into fake and non-fake. This repo provides a stack for

All those functionalities are provided in a template-wise manner for you in order to to kick-off your own project or - even better - contribute to this repo ;-)

A few words regarding the data we used her to train our model(s):

For training a model we currently use 2 Datasources. One datasource stem from Kaggle (https://www.kaggle.com/kerneler/starter-fake-news-dataset-german-9cc110a2-9/data). This dataset is a collection of news (non-fake) and fake-news, whereas fake news are derived from satire online editors like "Die Tagespresse" or "Der Postillion". This way, sarcastic news articles are treated as fake news in this dataset. (see EDA on that dataset --> notebooks/01-eda-german-fake-news.ipynb)

The second source for fake news is a dataset from Inna Vogel and Peter Jiang (2019)*. Thereby, Every fake statement in the text was verified claim-by-claim by authoritative sources (e.g. from local police authorities, scientific studies, the police press office, etc.). The time interval for most of the news is established from December 2015 to March 2018.

*Fake News Detection with the New German Dataset "GermanFakeNC". In Digital Libraries for Open Knowledge - 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, September 9-12, 2019, Proceedings (pp. 288–295).

Reproducing the Project and Contributing

  1. Fork the project on Dagshub
  2. Clone the project to your working machine
  3. Set up the .env file in the project root dir (.env.example)
  4. Set up the config.json in .azureml (config.json.example) -> Optional for deployment of the model
  5. Install anaconda-project via conda into a conda env of your choice (outside of this project)
  6. Activate that environment
  7. Run anaconda-project prepare in order to download and install the required packages to the conda env defined in anaconda-project.yml
  8. Experiment and commit changes to your own branch
  9. Push your work back up to your fork
  10. Submit a Pull request so that we can review your changes

NOTE: Be sure to merge the latest from "upstream" before making a pull request!

For reproduction consider the following:

  1. Check out all the .example files in order to get your env and credentials set up
  2. Become familiar with the dvc workflow (in combination with git)

Project Organization

.
├── LICENSE
├── .azureml                <- Store Azure specific configurations
├── README.md               <- The top-level README for developers using this project.
├── anaconda-project.yml
├── bin
│   └── models              <- Trained and serialized models (model.pkl)
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── dvc.lock
├── dvc.yaml
├── envs
│   ├── fake_news_env
│   └── inference_env
│── .env                    <- Env file to store env specific and/or private variables
├── metrics.csv
├── notebooks               <- Jupyter notebooks. Naming convention is a number
├── params.yml
├── references
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated graphics and figures to be used reporting
└── src                     <- Source code for use in this project.
    ├── __init__.py         <- Makes src a Python module
    ├── data                <- Scripts to download or generate data
    ├── features            <- Scripts to turn raw data into features for modeling
    ├── models              <- Scripts to train, evaluate, test and deploy models
    └── visualization       <- Scripts to create exploratory and results oriented viz

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Icons made by Freepik from www.flaticon.com
Tip!

Press p or to see the previous file or, n or to see the next file

About

Classify german fake news based on news articles (text). The training data comes from Kaggle and a claim-by-claim approved fake news dataset by authoritative sources (see README.md)

Collaborators 1

Comments

Loading...