Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

FirstBooks.py 3.7 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
  1. # ---
  2. # jupyter:
  3. # jupytext:
  4. # formats: ipynb,py:percent
  5. # text_representation:
  6. # extension: .py
  7. # format_name: percent
  8. # format_version: '1.3'
  9. # jupytext_version: 1.14.0
  10. # kernelspec:
  11. # display_name: Python 3 (ipykernel)
  12. # language: python
  13. # name: python3
  14. # ---
  15. # %% [markdown]
  16. # # First Books
  17. #
  18. # This notebook prepares a data set of book information for a prediction task to try to predict if a new author will publish a second book. Michael Ekstrand uses it for teaching data science.
  19. # %% [markdown]
  20. # ## Setup
  21. # %%
  22. import polars as pl
  23. import matplotlib.pyplot as plt
  24. import seaborn as sns
  25. # %% [markdown]
  26. # ## Book Statistics
  27. #
  28. # The first step is to compute some book interaction statistics.
  29. #
  30. # Let's load and link:
  31. # %%
  32. links = pl.scan_parquet('gr-book-ids.parquet')
  33. ixs = pl.scan_parquet('gr-interactions.parquet')
  34. ixs = ixs.join(links, on='book_id')
  35. # %% [markdown]
  36. # Now aggregate into statistics:
  37. # %%
  38. work_stats = ixs.groupby('work_id').agg([
  39. # number of add-to-shelf actions
  40. pl.col('rec_id').count().alias('n_shelves'),
  41. # number of distinct users who interact with it
  42. pl.col('user_id').n_unique().alias('n_users'),
  43. # number of ratings
  44. pl.col('rating').where(pl.col('rating') > 0).count().alias('n_rates'),
  45. # mean rating
  46. pl.col('rating').where(pl.col('rating') > 0).mean().alias('mean_rate'),
  47. # number of "positive" ratings
  48. (pl.col('rating') > 2).sum().alias('n_pos_rates'),
  49. ])
  50. # %% [markdown]
  51. # Link with work year:
  52. # %%
  53. book_info = pl.scan_parquet('gr-book-info.parquet')
  54. book_link = pl.scan_parquet('gr-book-link.parquet')
  55. book_info = book_info.join(book_link, on='book_id')
  56. work_year = book_info.groupby('work_id').agg(pl.col('pub_year').min())
  57. work_stats = work_stats.join(work_year, on='work_id')
  58. # %% [markdown]
  59. # Now a bit of a detour - we need authors. Let's load those:
  60. # %%
  61. book_authors = pl.scan_parquet('gr-book-authors.parquet')
  62. # %% [markdown]
  63. # And we want to get the *first* work of each author:
  64. # %%
  65. author_works = book_authors.join(book_info, on='book_id').filter(pl.col('pub_year').is_not_null()).sort([
  66. 'pub_year',
  67. 'pub_month',
  68. 'pub_date',
  69. ])
  70. author_first_work = author_works.groupby('author_id').agg(pl.col('work_id').first())
  71. # %% [markdown]
  72. # Now we only want authors first works that were published since GoodReads started in 2007, and no later than 2012 to give the author time to have a new book before the data runs out in 2017:
  73. # %%
  74. first_work_stats = work_stats.join(author_first_work, on='work_id')
  75. first_work_stats = first_work_stats.filter((pl.col('pub_year') >= 2008) & (pl.col('pub_year') <= 2012))
  76. # %% [markdown]
  77. # Ok - now we have a table of first-work statistics. We're going to take those authors, and find out how many total works they have in the data set.
  78. # %%
  79. author_nworks = first_work_stats.select(['author_id']).join(author_works, on='author_id').groupby('author_id').agg([
  80. pl.col('work_id').n_unique().alias('au_nbooks')
  81. ])
  82. # %% [markdown]
  83. # Join this with the original table:
  84. # %%
  85. mb_table = first_work_stats.join(author_nworks, on='author_id')
  86. # %% [markdown]
  87. # Finally, we can compute this entire table. How will Polars do it?
  88. # %%
  89. print(mb_table.describe_optimized_plan())
  90. # %% [markdown]
  91. # And run it:
  92. # %%
  93. mb_table = mb_table.collect()
  94. mb_table
  95. # %%
  96. mb_table.write_csv('author-first-works.csv')
  97. # %% [markdown]
  98. # ## Author Names
  99. #
  100. # Let's get author names and work titles for some debugging info.
  101. # %%
  102. authors = pl.read_parquet('gr-author-info.parquet')
  103. # %%
  104. authors = authors.join(mb_table.select('author_id'), on='author_id')
  105. # %%
  106. authors.write_csv('afw-author-names.csv')
  107. # %%
  108. works = pl.read_parquet('gr-work-info.parquet')
  109. works = works.join(mb_table.select('work_id'), on='work_id')
  110. works.write_csv('afw-work-titles.csv')
  111. # %%
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...