Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

preprocess_final.py 2.8 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
  1. ##############################################################################################
  2. ###### Cleaning and Preprocessing the final dataset of publications related to COVID-19 ######
  3. ##############################################################################################
  4. ########################################################################
  5. # Importing the required libraries.
  6. import csv, pandas as pd, numpy as np
  7. from preprocess import Preprocess
  8. ########################################################################
  9. class ProcessFinal(Preprocess):
  10. # Cleaning and preprocessing the final dataset.
  11. def _preprocess(self):
  12. # Defining the "None" value for the "NaN" values.
  13. self._dataframe.replace({np.nan: None}, inplace=True)
  14. # Changing the type of features.
  15. self._dataframe.loc[:, ["auth_keywords", "index_terms", "affiliations",
  16. "subject_areas", "authors", "author_affil", "references"]] = \
  17. self._dataframe.loc[:, ["auth_keywords", "index_terms", "affiliations",
  18. "subject_areas", "authors", "author_affil", "references"]].apply(
  19. lambda x: x.apply(lambda y: eval(y) if y else None))
  20. self._dataframe.publication_date = pd.to_datetime(self._dataframe.publication_date)
  21. # Defining the "zero" value for the articles without numbers of citation and references.
  22. self._dataframe.citation_num.loc[self._dataframe.citation_num.isnull()] = 0
  23. self._dataframe.ref_count.loc[self._dataframe.ref_count.isnull()] = 0
  24. # Extracting the missing authors from the feature "author_affil".
  25. self._dataframe.authors.loc[
  26. self._dataframe.authors.isnull() & self._dataframe.author_affil.notnull()] = [
  27. tuple([{"name": author["name"]} for author in authors if author["name"]])
  28. for authors in self._dataframe.author_affil[
  29. self._dataframe.authors.isnull() & self._dataframe.author_affil.notnull()]]
  30. # Removing the empty lists of authors.
  31. self._dataframe.authors.loc[self._dataframe.authors == ()] = None
  32. # Extracting the missing affiliations from the feature "author_affil".
  33. self._dataframe.affiliations.loc[
  34. self._dataframe.affiliations.isnull() & self._dataframe.author_affil.notnull()] = [
  35. tuple([{"affiliation": affil["affiliation"]} for affil in affils if affil["affiliation"]])
  36. for affils in self._dataframe.author_affil[
  37. self._dataframe.affiliations.isnull() & self._dataframe.author_affil.notnull()]]
  38. # Removing the empty lists of affiliations.
  39. self._dataframe.affiliations.loc[self._dataframe.affiliations == ()] = None
  40. # Defining the "None" value for the "NaN" values.
  41. self._dataframe.replace({np.nan: None}, inplace=True)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...