Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

4.2_benchmark.codegen.html 33 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
  1. ---
  2. title: CodeGen Benchmark
  3. keywords: fastai
  4. sidebar: home_sidebar
  5. summary: "This module is dedicated benchmarking"
  6. ---
  7. <!--
  8. #################################################
  9. ### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
  10. #################################################
  11. # file to edit: nbs/4.2_benchmark.codegen.ipynb
  12. # command to build the docs after a change: nbdev_build_docs
  13. -->
  14. <div class="container" id="notebook-container">
  15. <div class="cell border-box-sizing code_cell rendered">
  16. </div>
  17. <div class="cell border-box-sizing code_cell rendered">
  18. </div>
  19. <div class="cell border-box-sizing code_cell rendered">
  20. <div class="input">
  21. <div class="inner_cell">
  22. <div class="input_area">
  23. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;../benchmarking/traceability/&#39;</span><span class="p">)</span>
  24. </pre></div>
  25. </div>
  26. </div>
  27. </div>
  28. </div>
  29. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  30. <div class="text_cell_render border-box-sizing rendered_html">
  31. <h2 id="BPE-Testbed">BPE Testbed<a class="anchor-link" href="#BPE-Testbed">&#182;</a></h2>
  32. </div>
  33. </div>
  34. </div>
  35. <div class="cell border-box-sizing code_cell rendered">
  36. <div class="input">
  37. <div class="inner_cell">
  38. <div class="input_area">
  39. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">english_bpe</span> <span class="o">=</span> <span class="s1">&#39;english_bpe&#39;</span>
  40. <span class="n">italian_bpe</span> <span class="o">=</span> <span class="s1">&#39;italian_bpe&#39;</span>
  41. </pre></div>
  42. </div>
  43. </div>
  44. </div>
  45. </div>
  46. <div class="cell border-box-sizing code_cell rendered">
  47. <div class="input">
  48. <div class="inner_cell">
  49. <div class="input_area">
  50. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sp_model_from_glob</span><span class="p">(</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets/english&#39;</span><span class="p">,</span><span class="s1">&#39;*/*all*&#39;</span><span class="p">,</span> <span class="n">english_bpe</span><span class="p">)</span>
  51. <span class="n">sp_model_from_glob</span><span class="p">(</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets/italian&#39;</span><span class="p">,</span><span class="s1">&#39;*/*all*&#39;</span><span class="p">,</span> <span class="n">italian_bpe</span><span class="p">)</span>
  52. </pre></div>
  53. </div>
  54. </div>
  55. </div>
  56. </div>
  57. <div class="cell border-box-sizing code_cell rendered">
  58. <div class="input">
  59. <div class="inner_cell">
  60. <div class="input_area">
  61. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span>
  62. </pre></div>
  63. </div>
  64. </div>
  65. </div>
  66. <div class="output_wrapper">
  67. <div class="output">
  68. <div class="output_area">
  69. <div class="output_text output_subarea output_execute_result">
  70. <pre>PosixPath(&#39;../benchmarking/traceability/datasets&#39;)</pre>
  71. </div>
  72. </div>
  73. </div>
  74. </div>
  75. </div>
  76. <div class="cell border-box-sizing code_cell rendered">
  77. <div class="input">
  78. <div class="inner_cell">
  79. <div class="input_area">
  80. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">output_bpe_tokenization</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">languages</span><span class="p">):</span>
  81. <span class="k">for</span> <span class="n">language</span> <span class="ow">in</span> <span class="n">languages</span><span class="p">:</span>
  82. <span class="n">req_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*req]&#39;</span><span class="p">))</span>
  83. <span class="n">src_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*src]&#39;</span><span class="p">))</span>
  84. <span class="n">tc_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*tc]&#39;</span><span class="p">))</span>
  85. <span class="n">spm</span> <span class="o">=</span> <span class="n">sp</span><span class="o">.</span><span class="n">SentencePieceProcessor</span><span class="p">()</span>
  86. <span class="n">spm</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="nb">str</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="o">/</span><span class="n">f</span><span class="s2">&quot;</span><span class="si">{language}</span><span class="s2">_bpe.model&quot;</span><span class="p">)))</span>
  87. <span class="n">output</span> <span class="o">=</span> <span class="n">path</span><span class="o">/</span><span class="s1">&#39;testbeds&#39;</span><span class="o">/</span><span class="s1">&#39;bpe&#39;</span><span class="o">/</span><span class="n">language</span>
  88. <span class="n">req_docs</span> <span class="o">=</span> <span class="n">tokenize_fns</span><span class="p">(</span><span class="n">req_fns</span><span class="p">,</span> <span class="n">spm</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;txt&#39;</span><span class="p">,</span> <span class="s1">&#39;TXT&#39;</span><span class="p">],</span> <span class="n">output</span><span class="p">,</span> <span class="s1">&#39;req&#39;</span><span class="p">)</span>
  89. <span class="n">src_docs</span> <span class="o">=</span> <span class="n">tokenize_fns</span><span class="p">(</span><span class="n">src_fns</span><span class="p">,</span> <span class="n">spm</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;c&#39;</span><span class="p">,</span> <span class="s1">&#39;java&#39;</span><span class="p">],</span> <span class="n">output</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">)</span>
  90. <span class="n">tc_docs</span> <span class="o">=</span> <span class="n">tokenize_fns</span><span class="p">(</span><span class="n">tc_fns</span><span class="p">,</span> <span class="n">spm</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;c&#39;</span><span class="p">,</span> <span class="s1">&#39;java&#39;</span><span class="p">],</span> <span class="n">output</span><span class="p">,</span> <span class="s1">&#39;tc&#39;</span><span class="p">)</span>
  91. </pre></div>
  92. </div>
  93. </div>
  94. </div>
  95. </div>
  96. <div class="cell border-box-sizing code_cell rendered">
  97. <div class="input">
  98. <div class="inner_cell">
  99. <div class="input_area">
  100. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">languages</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;english&#39;</span><span class="p">,</span> <span class="s1">&#39;italian&#39;</span><span class="p">]</span>
  101. <span class="n">output_bpe_tokenization</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">languages</span><span class="p">)</span>
  102. </pre></div>
  103. </div>
  104. </div>
  105. </div>
  106. </div>
  107. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  108. <div class="text_cell_render border-box-sizing rendered_html">
  109. <h1 id="Entropy-Benchmark">Entropy Benchmark<a class="anchor-link" href="#Entropy-Benchmark">&#182;</a></h1>
  110. </div>
  111. </div>
  112. </div>
  113. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  114. <div class="text_cell_render border-box-sizing rendered_html">
  115. <h2 id="Read-in-the-data">Read in the data<a class="anchor-link" href="#Read-in-the-data">&#182;</a></h2>
  116. </div>
  117. </div>
  118. </div>
  119. <div class="cell border-box-sizing code_cell rendered">
  120. <div class="input">
  121. <div class="inner_cell">
  122. <div class="input_area">
  123. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">english_systems</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;itrust&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">],</span> <span class="s1">&#39;libest&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">,</span> <span class="s1">&#39;tc&#39;</span><span class="p">]}</span>
  124. <span class="n">italian_systems</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;albergate&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">],</span> <span class="s1">&#39;ebt&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">,</span> <span class="s1">&#39;tc&#39;</span><span class="p">],</span> <span class="s1">&#39;etour&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">],</span> <span class="s1">&#39;smos&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;req&#39;</span><span class="p">,</span> <span class="s1">&#39;src&#39;</span><span class="p">]}</span>
  125. </pre></div>
  126. </div>
  127. </div>
  128. </div>
  129. </div>
  130. <div class="cell border-box-sizing code_cell rendered">
  131. <div class="input">
  132. <div class="inner_cell">
  133. <div class="input_area">
  134. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">flatten</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="p">[</span><span class="n">item</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">l</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span>
  135. </pre></div>
  136. </div>
  137. </div>
  138. </div>
  139. </div>
  140. <div class="cell border-box-sizing code_cell rendered">
  141. <div class="input">
  142. <div class="inner_cell">
  143. <div class="input_area">
  144. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">calc_entropy_benchmark</span><span class="p">(</span><span class="n">systems</span><span class="p">,</span> <span class="n">language</span><span class="p">):</span>
  145. <span class="k">for</span> <span class="n">sys</span> <span class="ow">in</span> <span class="n">systems</span><span class="p">:</span>
  146. <span class="n">sys_docs</span> <span class="o">=</span> <span class="p">[]</span>
  147. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;System:&#39;</span><span class="p">,</span> <span class="n">sys</span><span class="p">)</span>
  148. <span class="k">for</span> <span class="n">data_type</span> <span class="ow">in</span> <span class="n">systems</span><span class="p">[</span><span class="n">sys</span><span class="p">]:</span>
  149. <span class="n">data_path</span> <span class="o">=</span> <span class="n">path</span><span class="o">/</span><span class="s1">&#39;testbeds/bpe&#39;</span><span class="o">/</span><span class="n">language</span><span class="o">/</span><span class="n">sys</span><span class="o">/</span><span class="n">data_type</span>
  150. <span class="n">sys_docs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">read_bpe_files</span><span class="p">(</span><span class="n">data_path</span><span class="p">))</span>
  151. <span class="n">entropies</span> <span class="o">=</span> <span class="n">get_entropies_from_docs</span><span class="p">(</span><span class="n">sys_docs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
  152. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Data Type:&#39;</span><span class="p">,</span> <span class="n">data_type</span><span class="p">)</span>
  153. <span class="n">report_stats</span><span class="p">(</span><span class="n">entropies</span><span class="p">)</span>
  154. <span class="n">entropy</span> <span class="o">=</span> <span class="n">get_entropy_from_docs</span><span class="p">(</span><span class="n">sys_docs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
  155. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Total Corpus Entropy:&#39;</span><span class="p">,</span> <span class="n">entropy</span><span class="p">)</span>
  156. <span class="nb">print</span><span class="p">()</span>
  157. <span class="n">entropy</span> <span class="o">=</span> <span class="n">get_entropy_from_docs</span><span class="p">(</span><span class="n">flatten</span><span class="p">(</span><span class="n">sys_docs</span><span class="p">))</span>
  158. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Total System Entropy:&#39;</span><span class="p">,</span> <span class="n">entropy</span><span class="p">)</span>
  159. <span class="n">entropy</span> <span class="o">=</span> <span class="n">shared_entropy_from_docs</span><span class="p">(</span><span class="n">sys_docs</span><span class="p">)</span>
  160. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Shared Entropy:&#39;</span><span class="p">,</span> <span class="n">entropy</span><span class="p">)</span>
  161. <span class="nb">print</span><span class="p">()</span>
  162. </pre></div>
  163. </div>
  164. </div>
  165. </div>
  166. </div>
  167. <div class="cell border-box-sizing code_cell rendered">
  168. <div class="input">
  169. <div class="inner_cell">
  170. <div class="input_area">
  171. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">calc_entropy_benchmark</span><span class="p">(</span><span class="n">english_systems</span><span class="p">,</span> <span class="s1">&#39;english&#39;</span><span class="p">)</span>
  172. </pre></div>
  173. </div>
  174. </div>
  175. </div>
  176. <div class="output_wrapper">
  177. <div class="output">
  178. <div class="output_area">
  179. <div class="output_subarea output_stream output_stdout output_text">
  180. <pre>System: itrust
  181. Data Type: req
  182. Max: 6.655947403853904
  183. Min: 3.6464393446710157
  184. Average: 5.125309432616202
  185. Median: 5.238901256602631
  186. Standard Deviation: 0.7675282320547024
  187. Median Absolute Deviation: 0.9497244658563296
  188. 95% of the data fall within 4.992640720488694 and 5.25797814474371
  189. Total Corpus Entropy: 8.138886303909846
  190. Data Type: src
  191. Max: 7.6191109926622875
  192. Min: 4.881336276904696
  193. Average: 6.522153794169928
  194. Median: 6.456654661625311
  195. Standard Deviation: 0.47046257540776115
  196. Median Absolute Deviation: 0.42263165151349985
  197. 95% of the data fall within 6.460067634640814 and 6.584239953699043
  198. Total Corpus Entropy: 8.562837202994778
  199. Total System Entropy: 8.68235305057625
  200. shared counts...
  201. Shared Entropy: 6.675375899716576
  202. System: libest
  203. Data Type: req
  204. Max: 8.133644403908326
  205. Min: 4.694019357121934
  206. Average: 6.543663643429754
  207. Median: 6.5960839256764
  208. Standard Deviation: 0.7998515650224866
  209. Median Absolute Deviation: 0.8070430386925508
  210. 95% of the data fall within 6.3209835459644115 and 6.766343740895097
  211. Total Corpus Entropy: 9.183085440385813
  212. Data Type: src
  213. Max: 8.170092696228092
  214. Min: 7.095192719326445
  215. Average: 7.753354441597045
  216. Median: 7.833169021739864
  217. Standard Deviation: 0.39264159352964123
  218. Median Absolute Deviation: 0.47071304168486106
  219. 95% of the data fall within 7.425097854785096 and 8.081611028408993
  220. Total Corpus Entropy: 8.36678856297728
  221. Data Type: tc
  222. Max: 8.842816439862581
  223. Min: 6.981127448895606
  224. Average: 7.833269648196844
  225. Median: 7.6436099893337754
  226. Standard Deviation: 0.5961050558091697
  227. Median Absolute Deviation: 0.49366977593502653
  228. 95% of the data fall within 7.561925879993082 and 8.104613416400605
  229. Total Corpus Entropy: 8.642622470130963
  230. Total System Entropy: 8.995630522415091
  231. shared counts...
  232. Shared Entropy: 7.217719542607787
  233. </pre>
  234. </div>
  235. </div>
  236. </div>
  237. </div>
  238. </div>
  239. <div class="cell border-box-sizing code_cell rendered">
  240. <div class="input">
  241. <div class="inner_cell">
  242. <div class="input_area">
  243. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">calc_entropy_benchmark</span><span class="p">(</span><span class="n">english_systems</span><span class="p">,</span> <span class="s1">&#39;english&#39;</span><span class="p">)</span>
  244. </pre></div>
  245. </div>
  246. </div>
  247. </div>
  248. <div class="output_wrapper">
  249. <div class="output">
  250. <div class="output_area">
  251. <div class="output_subarea output_stream output_stdout output_text">
  252. <pre>System: itrust
  253. Data Type: req
  254. Max: 6.655947403853904
  255. Min: 3.6464393446710157
  256. Average: 5.125309432616202
  257. Median: 5.238901256602631
  258. Standard Deviation: 0.7675282320547024
  259. Median Absolute Deviation: 0.9497244658563296
  260. 95% of the data fall within 4.992640720488694 and 5.25797814474371
  261. Total Corpus Entropy: 8.138886303909846
  262. Data Type: src
  263. Max: 7.6191109926622875
  264. Min: 4.881336276904696
  265. Average: 6.522153794169928
  266. Median: 6.456654661625311
  267. Standard Deviation: 0.47046257540776115
  268. Median Absolute Deviation: 0.42263165151349985
  269. 95% of the data fall within 6.460067634640814 and 6.584239953699043
  270. Total Corpus Entropy: 8.562837202994778
  271. Total System Entropy: 8.68235305057625
  272. Shared Entropy: 6.675375899716576
  273. System: libest
  274. Data Type: req
  275. Max: 8.133644403908326
  276. Min: 4.694019357121934
  277. Average: 6.543663643429754
  278. Median: 6.5960839256764
  279. Standard Deviation: 0.7998515650224866
  280. Median Absolute Deviation: 0.8070430386925508
  281. 95% of the data fall within 6.3209835459644115 and 6.766343740895097
  282. Total Corpus Entropy: 9.183085440385813
  283. Data Type: src
  284. Max: 8.170092696228092
  285. Min: 7.095192719326445
  286. Average: 7.753354441597045
  287. Median: 7.833169021739864
  288. Standard Deviation: 0.39264159352964123
  289. Median Absolute Deviation: 0.47071304168486106
  290. 95% of the data fall within 7.425097854785096 and 8.081611028408993
  291. Total Corpus Entropy: 8.36678856297728
  292. Data Type: tc
  293. Max: 8.842816439862581
  294. Min: 6.981127448895606
  295. Average: 7.833269648196844
  296. Median: 7.6436099893337754
  297. Standard Deviation: 0.5961050558091697
  298. Median Absolute Deviation: 0.49366977593502653
  299. 95% of the data fall within 7.561925879993082 and 8.104613416400605
  300. Total Corpus Entropy: 8.642622470130963
  301. Total System Entropy: 8.995630522415091
  302. Shared Entropy: 7.217719542607787
  303. </pre>
  304. </div>
  305. </div>
  306. </div>
  307. </div>
  308. </div>
  309. <div class="cell border-box-sizing code_cell rendered">
  310. <div class="input">
  311. <div class="inner_cell">
  312. <div class="input_area">
  313. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">calc_entropy_benchmark</span><span class="p">(</span><span class="n">italian_systems</span><span class="p">,</span> <span class="s1">&#39;italian&#39;</span><span class="p">)</span>
  314. </pre></div>
  315. </div>
  316. </div>
  317. </div>
  318. <div class="output_wrapper">
  319. <div class="output">
  320. <div class="output_area">
  321. <div class="output_subarea output_stream output_stdout output_text">
  322. <pre>System: albergate
  323. Data Type: req
  324. Max: 7.451061154959707
  325. Min: 6.7760271692033065
  326. Average: 7.114662232676978
  327. Median: 7.136128583124726
  328. Standard Deviation: 0.18028670118666462
  329. Median Absolute Deviation: 0.19215087521795557
  330. 95% of the data fall within 7.021967364311235 and 7.207357101042722
  331. Total Corpus Entropy: 8.333064635075106
  332. Data Type: src
  333. Max: 7.632003142360007
  334. Min: 5.694455777930451
  335. Average: 6.698395952158591
  336. Median: 6.585514345171939
  337. Standard Deviation: 0.47533702601616096
  338. Median Absolute Deviation: 0.55637410510539
  339. 95% of the data fall within 6.569894354034409 and 6.826897550282773
  340. Total Corpus Entropy: 8.02635009717346
  341. Total System Entropy: 8.284551907349753
  342. Shared Entropy: 5.704935783592468
  343. System: ebt
  344. Data Type: req
  345. Max: 4.85798099512757
  346. Min: 3.169925001442312
  347. Average: 4.036522483018428
  348. Median: 4.037401197654112
  349. Standard Deviation: 0.4423825943264807
  350. Median Absolute Deviation: 0.49957908952600216
  351. 95% of the data fall within 3.896889307383322 and 4.176155658653533
  352. Total Corpus Entropy: 6.787949596598939
  353. Data Type: src
  354. Max: 7.297368573550914
  355. Min: 4.784576473149472
  356. Average: 5.963495662337905
  357. Median: 5.903977747837278
  358. Standard Deviation: 0.633890544451061
  359. Median Absolute Deviation: 0.7640379035089384
  360. 95% of the data fall within 5.783345963113105 and 6.143645361562706
  361. Total Corpus Entropy: 8.433164216462012
  362. Data Type: tc
  363. Max: 6.097097085934416
  364. Min: 4.704511459715549
  365. Average: 5.247301653814626
  366. Median: 5.281405982501043
  367. Standard Deviation: 0.36921703418357243
  368. Median Absolute Deviation: 0.4094487421886657
  369. 95% of the data fall within 5.094896352658618 and 5.399706954970633
  370. Total Corpus Entropy: 7.081408121899548
  371. Total System Entropy: 8.658242334349849
  372. Shared Entropy: 5.006590016354143
  373. System: etour
  374. Data Type: req
  375. Max: 6.237393834397653
  376. Min: 5.29192090403933
  377. Average: 5.852879032919363
  378. Median: 5.845473100698095
  379. Standard Deviation: 0.19175655109469364
  380. Median Absolute Deviation: 0.1803895702893011
  381. 95% of the data fall within 5.802459218071262 and 5.903298847767464
  382. Total Corpus Entropy: 7.10566459632011
  383. Data Type: src
  384. Max: 8.048968980820781
  385. Min: 5.539696852908118
  386. Average: 6.562921919889305
  387. Median: 6.565193494077665
  388. Standard Deviation: 0.541016209002899
  389. Median Absolute Deviation: 0.5387698222061211
  390. 95% of the data fall within 6.463421809399394 and 6.662422030379217
  391. Total Corpus Entropy: 8.74630095106817
  392. Total System Entropy: 8.821970266170094
  393. Shared Entropy: 5.822212387865012
  394. System: smos
  395. Data Type: req
  396. Max: 6.620285755044158
  397. Min: 5.400701696091561
  398. Average: 6.117630402688541
  399. Median: 6.140428349929661
  400. Standard Deviation: 0.3028871332208669
  401. Median Absolute Deviation: 0.29838532570331994
  402. 95% of the data fall within 6.043750425878531 and 6.191510379498552
  403. Total Corpus Entropy: 7.395107792428341
  404. Data Type: src
  405. Max: 7.868448246363015
  406. Min: 5.351552244391098
  407. Average: 6.646037374067181
  408. Median: 6.700747575555027
  409. Standard Deviation: 0.4604239797792223
  410. Median Absolute Deviation: 0.354721814886772
  411. 95% of the data fall within 6.554679267511279 and 6.737395480623084
  412. Total Corpus Entropy: 8.357381483417843
  413. Total System Entropy: 8.595914118672756
  414. Shared Entropy: 5.660800120573564
  415. </pre>
  416. </div>
  417. </div>
  418. </div>
  419. </div>
  420. </div>
  421. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  422. <div class="text_cell_render border-box-sizing rendered_html">
  423. <h1 id="SCRATCH-WORK">SCRATCH WORK<a class="anchor-link" href="#SCRATCH-WORK">&#182;</a></h1>
  424. </div>
  425. </div>
  426. </div>
  427. <div class="cell border-box-sizing code_cell rendered">
  428. <div class="input">
  429. <div class="inner_cell">
  430. <div class="input_area">
  431. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">language</span> <span class="ow">in</span> <span class="n">languages</span><span class="p">:</span>
  432. <span class="n">req_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*req]&#39;</span><span class="p">))</span>
  433. <span class="n">src_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*src]&#39;</span><span class="p">))</span>
  434. <span class="n">tst_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*tc]&#39;</span><span class="p">))</span>
  435. <span class="n">spm</span> <span class="o">=</span> <span class="n">sp</span><span class="o">.</span><span class="n">SentencePieceProcessor</span><span class="p">()</span>
  436. <span class="n">spm</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="nb">str</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets&#39;</span><span class="o">/</span><span class="n">language</span><span class="o">/</span><span class="n">f</span><span class="s2">&quot;</span><span class="si">{model_name}</span><span class="s2">_bpe.model&quot;</span><span class="p">))</span>\
  437. <span class="n">all_fns</span> <span class="o">=</span> <span class="n">flatten</span><span class="p">(</span><span class="n">req_fns</span> <span class="o">+</span> <span class="n">src_fns</span> <span class="o">+</span> <span class="n">tst_fns</span><span class="p">)</span>
  438. <span class="n">all_docs</span> <span class="o">=</span> <span class="n">tokenize_fns</span><span class="p">(</span><span class="n">all_fns</span><span class="p">,</span> <span class="n">spm</span><span class="p">)</span>
  439. </pre></div>
  440. </div>
  441. </div>
  442. </div>
  443. </div>
  444. <div class="cell border-box-sizing code_cell rendered">
  445. <div class="input">
  446. <div class="inner_cell">
  447. <div class="input_area">
  448. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">req_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets/english&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*req]&#39;</span><span class="p">))</span>
  449. <span class="n">src_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets/english&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*src]&#39;</span><span class="p">))</span>
  450. <span class="n">tst_fns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">((</span><span class="n">path</span><span class="o">/</span><span class="s1">&#39;datasets/english&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;**/*tc]&#39;</span><span class="p">))</span>
  451. <span class="n">req_fns</span><span class="p">[:</span><span class="mi">5</span><span class="p">],</span> <span class="n">src_fns</span><span class="p">[:</span><span class="mi">5</span><span class="p">],</span> <span class="n">tst_fns</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span>
  452. </pre></div>
  453. </div>
  454. </div>
  455. </div>
  456. <div class="output_wrapper">
  457. <div class="output">
  458. <div class="output_area">
  459. <div class="output_text output_subarea output_execute_result">
  460. <pre>([PosixPath(&#39;../benchmarking/traceability/datasets/english/libest/[libest-raw-req]&#39;),
  461. PosixPath(&#39;../benchmarking/traceability/datasets/english/itrust/[itrust-raw-req]&#39;)],
  462. [PosixPath(&#39;../benchmarking/traceability/datasets/english/libest/[libest-raw-src]&#39;),
  463. PosixPath(&#39;../benchmarking/traceability/datasets/english/itrust/[itrust-raw-src]&#39;)],
  464. [PosixPath(&#39;../benchmarking/traceability/datasets/english/libest/[libest-raw-tc]&#39;)])</pre>
  465. </div>
  466. </div>
  467. </div>
  468. </div>
  469. </div>
  470. <div class="cell border-box-sizing code_cell rendered">
  471. <div class="input">
  472. <div class="inner_cell">
  473. <div class="input_area">
  474. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">all_fns</span> <span class="o">=</span> <span class="n">flatten</span><span class="p">(</span><span class="n">req_fns</span> <span class="o">+</span> <span class="n">src_fns</span> <span class="o">+</span> <span class="n">tst_fns</span><span class="p">)</span>
  475. <span class="n">all_docs</span> <span class="o">=</span> <span class="n">tokenize_fns</span><span class="p">(</span><span class="n">all_fns</span><span class="p">,</span> <span class="n">spm</span><span class="p">)</span>
  476. </pre></div>
  477. </div>
  478. </div>
  479. </div>
  480. </div>
  481. <div class="cell border-box-sizing code_cell rendered">
  482. <div class="input">
  483. <div class="inner_cell">
  484. <div class="input_area">
  485. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">req_docs</span> <span class="o">=</span>
  486. </pre></div>
  487. </div>
  488. </div>
  489. </div>
  490. </div>
  491. <div class="cell border-box-sizing code_cell rendered">
  492. <div class="input">
  493. <div class="inner_cell">
  494. <div class="input_area">
  495. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">src_fns</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span>
  496. </pre></div>
  497. </div>
  498. </div>
  499. </div>
  500. <div class="output_wrapper">
  501. <div class="output">
  502. <div class="output_area">
  503. <div class="output_text output_subarea output_execute_result">
  504. <pre>&#39;libest&#39;</pre>
  505. </div>
  506. </div>
  507. </div>
  508. </div>
  509. </div>
  510. <div class="cell border-box-sizing code_cell rendered">
  511. <div class="input">
  512. <div class="inner_cell">
  513. <div class="input_area">
  514. <div class=" highlight hl-ipython3"><pre><span></span><span class="nb">list</span><span class="p">(</span><span class="n">path</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;datasets/italian/*/*&#39;</span><span class="p">))</span>
  515. </pre></div>
  516. </div>
  517. </div>
  518. </div>
  519. <div class="output_wrapper">
  520. <div class="output">
  521. <div class="output_area">
  522. <div class="output_text output_subarea output_execute_result">
  523. <pre>[PosixPath(&#39;../benchmarking/traceability/datasets/italian/smos/[smos-raw-src]&#39;),
  524. PosixPath(&#39;../benchmarking/traceability/datasets/italian/smos/[smos-raw-req]&#39;),
  525. PosixPath(&#39;../benchmarking/traceability/datasets/italian/smos/[smos-all].txt&#39;),
  526. PosixPath(&#39;../benchmarking/traceability/datasets/italian/albergate/[albergate-all].txt&#39;),
  527. PosixPath(&#39;../benchmarking/traceability/datasets/italian/albergate/[albergate-raw-src]&#39;),
  528. PosixPath(&#39;../benchmarking/traceability/datasets/italian/albergate/[albergate-raw-req]&#39;),
  529. PosixPath(&#39;../benchmarking/traceability/datasets/italian/ebt/[ebt-all].txt&#39;),
  530. PosixPath(&#39;../benchmarking/traceability/datasets/italian/ebt/[ebt-raw-src]&#39;),
  531. PosixPath(&#39;../benchmarking/traceability/datasets/italian/ebt/[ebt-raw-tc].txt&#39;),
  532. PosixPath(&#39;../benchmarking/traceability/datasets/italian/ebt/[ebt-raw-req].txt&#39;),
  533. PosixPath(&#39;../benchmarking/traceability/datasets/italian/etour/[etour-raw-src]&#39;),
  534. PosixPath(&#39;../benchmarking/traceability/datasets/italian/etour/[etour-raw-req]&#39;),
  535. PosixPath(&#39;../benchmarking/traceability/datasets/italian/etour/[etour-all].txt&#39;)]</pre>
  536. </div>
  537. </div>
  538. </div>
  539. </div>
  540. </div>
  541. <div class="cell border-box-sizing code_cell rendered">
  542. <div class="input">
  543. <div class="inner_cell">
  544. <div class="input_area">
  545. <div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">nbdev.export</span> <span class="k">import</span> <span class="n">notebook2script</span>
  546. <span class="n">notebook2script</span><span class="p">()</span>
  547. </pre></div>
  548. </div>
  549. </div>
  550. </div>
  551. <div class="output_wrapper">
  552. <div class="output">
  553. <div class="output_area">
  554. <div class="output_subarea output_stream output_stdout output_text">
  555. <pre>Converted 00_mgmnt.prep.i.ipynb.
  556. Converted 01_exp.i.ipynb.
  557. Converted 02_mgmnt.db.mongo.ipynb.
  558. Converted 03_repr.i.ipynb.
  559. Converted 04_mining.ir.model.ipynb.
  560. Converted 05_mining.ir.i.ipynb.
  561. Converted 06_benchmark.traceability.ipynb.
  562. Converted 07_repr.roberta.train.ipynb.
  563. Converted 08_exp.info.ipynb.
  564. Converted 09_desc.stats.ipynb.
  565. Converted 10_vis.ipynb.
  566. Converted 11_mgmnt.prep.nltk.ipynb.
  567. Converted 12_repr.roberta.eval.ipynb.
  568. Converted 14_mgmnt.prep.bpe.ipynb.
  569. Converted 15_desc.metrics.se.ipynb.
  570. Converted 16_repr.word2vec.train.ipynb.
  571. Converted 17_repr.doc2vec.train.ipynb.
  572. Converted 18_repr.doc2vec.eval.ipynb.
  573. Converted 19_repr.word2vec.eval.ipynb.
  574. Converted 20_benchmark.codegen.ipynb.
  575. Converted 21_inf.i.ipynb.
  576. Converted 22_inf.bayesian.ipynb.
  577. Converted 23_inf.causal.ipynb.
  578. Converted aa_blog.example.ipynb.
  579. Converted ab_templates.example.ipynb.
  580. Converted ac_emp.eval.pp1.rq1.ipynb.
  581. Converted ad_emp.eval.pp1.rq2.ipynb.
  582. Converted ae_emp.eval.pp1.rq3.ipynb.
  583. Converted af_emp.eval.pp1.rq4.ipynb.
  584. Converted index.ipynb.
  585. </pre>
  586. </div>
  587. </div>
  588. </div>
  589. </div>
  590. </div>
  591. </div>
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...