-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathTODO
2611 lines (2257 loc) · 112 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
-:[bs]ug +:done x:dropped ^:forwarded v:with-history #:note ?:question
Thu Dec 05 2013
- simplify hxlib: dlopen actually does all that LD_LIBRARY_PATH traversal itself (!)
Mon Nov 27 2013
- hxlox improvements
- hxlock(hp) that lets you bypass all lock calls --- released by hxrel.
Typically used by one-writer/many-readers, where the writer process signals
all readers to release their persistent locks, falling back to intra-call locks,
which are slower but permit readers to continue while writer does its updates
using intra-call locks.
Sun Oct 20 2013
- hxbuild should use mmap'd file.
Thu Oct 03 2013
# Rolled back hxlox upgrades until I have time to sort them out.
Currently, many_t.pass but it's always locking (0,0)! WTF
- hxopen *overrides* a previous setting of hxdebug, with getenv("HXDEBUG").
- use the leap-by-two idea to simplify finding loops in chains in hxfix().
int hasloop(LIST *list)
{
if (list == NULL) return 0;
LIST *slow = list, *fast = list;
while (1) {
fast = fast->next;
if (fast == NULL) return 0;
if (fast == slow) return 1;
fast = fast->next;
if (fast == NULL) return 0;
if (fast == slow) return 1;
slow = slow->next;
}
}
Tue Oct 01 2013
- Clean up the lock mechanism:
- add a sorted PAGENO vector that has the length in [0]
Thu Sep 26 2013
- hxfreeze(hp,onoff) takes a persisent read lock on the whole file, and stores npages in hp.
hxrel(hp) is independent of this ... perhaps hxnext can ignore locking issues
altogether and become a dirty scan?
- add npages to HXFILE{}. When it's not zero, the file is frozen.
- all ops except (hxget,hxnext) are BAD_REQUEST while frozen.
- Calculate the number of cache line hits for each hxget, assuming hxfreeze().
- page header; hindex; record
Wed Sep 25 2013
- Optimize unlock to _hxunlock(locp,0,0), if hp->lockv is empty.
Sun Sep 22 2013
- For HXFILE to compete with CDB, it needs a persistent file-wide readonly lock
("hxlock"? "hxfreeze"?) that is (only) released by "hxrel".
While the file is under hxlock, it does no fcntl or lseek(END) --- the file size is const, too.
no more lseeks, fwiw. Write a test that a file locked that way can be shared
but not updated.
Sat Sep 21 2013
- Make filetype mandatory leading string ("", 1) instead of (NULL, 0) unless HX_STATIC specified.
Make hxopen fail if file has no type? How do you know which is the error?
- no such file
- invalid file header
- unable to load typelib
- deps are wrong: many_t.pass ought to force many_t to rebuild, and many_t should rebuild
if hxlox.c changes hence libhx.a
AHA: the build pattern (%.pass : %) does nothing, because the targets are
local names (no proj-specific prefix boo) while deps are not.
many_t.pass : many_t (but you only have rules for building $(PWD)/many_t). Gah.
+ locking has another problem. HIGH locks just the head: BOTH tries to lock the old split (5,6)
but then finds that the file size has changed (so SPLIT moved from 5 to 7),
so it unlocks all but the head ... then tries to lock pgno=(9,10)???
# solution is that when locking "just" the head, instead lock the set of
potential splits starting at head.
Thu Sep 19 2013
+ sed -i 's/sizeof(\(\**[a-z][a-z_0-9]*\))/sizeof \1/g; s/sizeof(\("[^"]*"\))/sizeof \1/g' *.[ch]
Fri Sep 13 2013
- make _hxfindHeads take an output argument instead of implicitly using locp->vprev.
Wed Sep 11 2013
+ deadlocks are possible. At first, it looks like D computed the head (15)
the split (13). When it reached hxlockset(BOTH), the file size had already
changed, and the split was now GREATER than 15. WTF? How did the split move
past (15) when D had 15 locked??
From many_t:
@PUT D00205 89
lock start=15 count=1 npages=33 < ---() _hxlock:52 D
lock > ---(15) _hxlock:101 D
load pgno=15 next=16 used=3940 recs=58 hsize=74 _hxload:271 D
lock start=16 count=1 npages=33 < ---(15) _hxlock:52 D
lock > ---(15 16) _hxlock:101 D
load pgno=16 next=0 used=3939 recs=51 hsize=74 _hxload:271 D
unlock start=16 count=1 npages=33 < ---(15 16) _hxunlock:203 D
unlock > ---(15) _hxunlock:228 D
lock start=13 count=2 npages=33 < ---(15) _hxlock:52 D
lock > ---(15 13 14) _hxlock:101 D
lock start=11 count=1 npages=33 < ---(15 13 14) _hxlock:52 D
lock > ---(15 13 14 11) _hxlock:101 D
npages changes: 33 >> 37 _hxsize:392 D
head=15 old=15 part=BOTH _hxlockset:190 D
unlock start=1 count=14 npages=37 < ---(15 13 14 11) _hxunlock:203 D
unlock > ---(15) _hxunlock:228 D
lock start=17 count=2 npages=37 < ---(15) _hxlock:52 D
Mon Jul 29 2013
- remaining tasks before release:
- hard proof that deadlocks are impossible
- incremental hxindexify
- 48bit hashes
Wed Jul 03 2013
- add hindex to "chx dump"
Sat Jun 22 2013
+ add a serious concurrency test, that concurrent update is (a) correct (b) efficient.
- fork ten processes (0..9)
- add key=n00000..n99999, val=0
- 10 times, for each key, val=val+1
- check that there are 000000..999999 with value 10.
- hh
- the hxgrow/hxsplit logic could be made a tiny bit smarter by noticing when
when its source is (head) and both its target pages had enough space for
the record to be inserted.
Wed Jun 19 2013
- Use XMM code to do the offset-adjustment of hind[] AND behind[] really fast.
Suggest making behind[] have a multiple of 8 elements.
#ifdef __SSE2__
static const short mask = { 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1 };
__m128i *xp, xoff = _mm_set1_epi16(offset), xlen = _mm_set1_epi16(delta);
pos = (bufp->used + 1)/2;
if (pos & 7)
xp = (__m128i*)(hind - pos),
*xp = _mm_add_epi16(*xp, _mm_and_si128(_mm_lddqu_si128(mask + 8 - pos)),
_mm_and_si128(xlen, _mm_cmpgt_epi16(*xp, xoff)));
for (;(char*)xp < bufp->data + DATASIZE(hp); ++xp)
*xp = _mm_add_epi16(*xp, _mm_and_si128(xlen, _mm_cmpgt_epi16(*xp, xoff));
for (xp = (__m128i*)bufp->behind; (COUNT*)xp < bufp->behind+bufp->nbehind; ++xp)
*xp = _mm_add_epi16(*xp, _mm_and_si128(xlen, _mm_cmpgt_epi16(*xp, xoff));
#else
for (i = HINDSIZE(bufp); --i >= 0;)
if (bufp->hind[-i] > offset) bufp->hind[-i] += delta;
for (i = 0; i < bufp->nbehind; ++i)
if (bufp->behind[i] > offset) bufp->behind[i] += delta;
#endif
+ _hxmask >> MASK
- GNUmakefile install xx
- hx_.c scramble hash
- hxshape.c (gets _hxmove logic)
Sun May 26 2013
- next step: 64bit hashes
- add something like FOR_EACH_REC for traversing hind[].
- when you _hxremove a record, be sure to zap its hind[] entry,
else _redex will save that bogus offset.
Hey! its offset will be saved in buf.{offset,delta}
for the fixup.
Wed May 22 2013
- incremental change tracking in _hxshift is BRUTAL.
It is not worth it for srcp, since you create multiple holes.
dstp changes (where dstp != srcp) are doable:
if dstp->used + size > dstp->orig
scan [dstp->used ..< dstp->used+size] for nonzero hash entries
and push them into dstp->redex[] before memmove.
- cases:
remove(p1,old)
... save bufp(offset,oldsize)
shift(p1,p2) ---
split
... while shifting, when
<:
=:
>:
shift(p2,p3) [ < ]
append(p3,new) [ < = > ]
- remove(p1,old) shift(p1,p2)[ < = > ] shift(p2,p3) [ = > ] append(p3,new) [ > ]
- replace(p1,old,new)[ < = > ]; shift(p1,p2)
... and does hxgrow FORCE hxindexify, or does it become a fallthrough?
# after remove: set locp->(offset,delta) for later
# during shift, cache hind(+delta?) from data[max(used,orig)..used+len]
# after shift
Mon May 20 2013
- split hxupd etc into routines that SET Dirty and those that TEST Dirty.
# making the hash table smaller (e.g. power of 2 > (nrecs + 1) * 5 / 4
is a dead end ... it makes insertion AND search less efficient.
Better to work out the separate incremental steps carefully.
- basic plan is clear:
+ make _hxindexed() independent of _hxindexify()
- add redex[] list (limited size) to HXBUF
? what affects hind[]?
- (offset,delta) where delta < 0
- (offset,orig,used) where used < orig
Currently, hxput is careful to REPLACE an old rec at the same offset.
Suppose remove(p1,oldrec), shift(p1,p2), append(p2,newrec).
_hxsave (p1) must adjust offsets of all records after deletion:
COUNT *hind = (COUNT*)bufp->data + (bufp->used + 1)/2
COUNT *hend = (COUNT*)(bufp->data + DATASIZE(hp));
for (; hind < hend; ++hind) if (*hind > pos) *hind += delta;
shifted records are not in (p1).hind yet.
Sun May 19 2013
- performance:
- hind fixed _hxfind. Now what fixes _hxindexify?
hxput (MMAP) spends 60% of its time in hxindexify.
_hxfind is still not cheap! it is 30% of hxget.
$ perf_x 1mx.tab 24
index % time self children called name
-----------------------------------------------
0.11 3.63 4000000/4000000 main [1]
[2] 53.9 0.11 3.63 4000000 hxput [2]
0.05 1.79 4639402/4639402 sync_save [4]
0.22 0.23 471267/481223 _hxshift [11]
0.32 0.04 2639402/8049393 _hxload [8]
0.29 0.03 2163281/7516554 _hxfind [7]
-----------------------------------------------
0.06 2.49 4000000/4000000 main [1]
[3] 36.7 0.06 2.49 4000000 hxget [3]
0.73 0.07 5353273/7516554 _hxfind [7]
0.65 0.08 5353273/8049393 _hxload [8]
-----------------------------------------------
0.05 1.79 4639402/4639402 hxput [2]
[4] 26.5 0.05 1.79 4639402 sync_save [4]
0.03 1.75 2215911/2284310 _hxsave [5]
-----------------------------------------------
0.03 1.75 2215911/2284310 sync_save [4]
[5] 26.5 0.04 1.80 2284310 _hxsave [5]
1.50 0.28 2262950/2262950 _hxindexify [6]
-----------------------------------------------
1.50 0.28 2262950/2262950 _hxsave [5]
[6] 25.7 1.50 0.28 2262950 _hxindexify [6]
-----------------------------------------------
0.29 0.03 2163281/7516554 hxput [2]
0.73 0.07 5353273/7516554 hxget [3]
[7] 16.1 1.02 0.10 7516554 _hxfind [7]
x _hxfindfree: single isolated optimization.
Using an 8-byte-stride ffz() makes no measurable diff.
- multithreading. flock ([pid,inode]) is the problem.
Tue May 14 2013
+ perf_x 1m.tab 24 crashes on a hxalloc:40 BAD_FILE:
the page being freed is marked (0) in the bitmap.
# Culprit was the logic to save scanning a tail page
for any recs that matched locp->head. SOMETIMES
you get this for free from _hxshift. Sometimes not.
- _hxindexify is now the top time-user in perf.
Each hxbuf has its own little cache of entries to re-insert.
(bufp->ncached == MAX_CACHED) is the way to say, "_hxindexify!"
_hxappend needs an easy test for whether it has overwritten
part of the index. That suggests that the index should always
be (say) the power of 2 >= (nrecs+1)*9/8.
- "+1" guarantees that there is at least one empty slot,
so the stopping condition is easy?
- this makes the masking calculation (hash -> hindex) simpler.
- make _hxindexify do a two-pass insertion to improve median lookup time.
- incremental updates to hind[]:
- use coverage to determine frequency
- everything happens inside hxput and _hxgrow(_hxshift)
- _hxshift needs to be smart enough to flag minimal cases.
Should _hxappend control the reindex cache?
- _hxremove: adjust OFFSET of every hin
This may take more analysis: perhaps hind[] needs to be limited in size.
- _hxload (on an MMAP'd file!) is in second place after _hxindexify.
Mon May 13 2013
- improve hxstat: add rechash and improve keyhash:
I need an order-independent hash across all records in a file.
- use low byte of hash as an index into a table of hashes.
- xor hashes into respective table slots.
- do a non-symmetric aggregate across the table(s).
+ remove the third... args from hxopen.
Convert hxfunc into hxbind(hp,diff,hash,load,save,test).
Thu May 09 2013
+ perftest hxget-only random read mode
- perftest 100 processes calling hxput,hxget.
- perhaps locking pages below the split might require something else
Mon Apr 29 2013
- BUG: somehow we sometimes get a hxind[] with no zero entries
and _hxfind spins forever.
- update _hxprbuf to print hind[] (!)
Sun Apr 28 2013
- latest perf_x shows _hxfindfree is the single biggest cost ---
about 45% of hxput. 96% of _hxfindfree's time is in its own code.
- use SSE2 or to speed up search for a zero bit in a map page
Try using ULL+bsfl instead, first; code is less cryptic.
- cache the end point (page, offset) of the last search.
Cycle through the map pages in the file. This only helps
if there are few other processes freeing pages.
Mon Feb 11 2013
- make hxstat compute a record hash (fnv04) instead of a key hash.
Mon Nov 12 2012
- hxlib parsing of $LD_LIBRARY_PATH is wrong: it treats "::" as an empty
string, not as ".", before appending "/hx_xxx.so" to it.
Tue Oct 16 2012
- Perhaps keep a tiny change log in the HXBUF, and only do
(incremental/full) _hxindexify in hxsave. This means you can
do incremental changes even if delta<0 or next!=0
- failing test in corrupt_t ("bad free next") which forces used=0
means we need to add more proper testing of hind[].
Guaranteed that any nonzero bytes in a page with used=0
is a corruption; as well as a chain that points at a page with used=0
is a corruption.
- HX has a deadlock-avoidance scheme: hashed pages are always locked in a
specific (descending) order for the duration of an API call.
This gets complicated for "hxput", which may have to split a bunch of
pages to create space for a new record.
This means that multiple threads can work together if they
can synchronize on the current state of (ino,pid).
It's tricky, because no thread can unlock part of the file that
some other thread might need.
Otherwise, they have to single-thread themselves on the (ino).
This also has its problems. How do you associate a pthread_mutex with
an inode? You have a race condition between threads creating that
mutex, or the mutex covering a static table that maps (inode <=> mutex).
Tue Sep 18 2012
- implement hxgetv and hxputv that do everything within one whole-file lock.
ret = hxgetv(hp, nrecs, char **recv, int *lenv)
ret = hxputv(hp, nrecs, char const**recv, int const *lenv)
On entry, recv[i] must point to a buffer of at least lenv[i].
On success, ret is 0.
If there is a partial failure partway through processing the list,
lenv[i] will be BAD_REQUEST (-1) for all unprocessed records.
Note that recv[] will not necessarily be processed in order.
No, this doesn't help, because locking is insignificant overhead
once disk IO becomes an issue.
Thu Sep 13 2012
^ Implement HX_EXCLUSIVE mode. File size is stored in (HXFILE*).
Perhaps keep a copy of the entire HXLOCAL struct in HXFILE.
No fcntl or lseek(END) needed.
Fri Sep 07 2012
- consider putting hind[] at the front of the page and data (recs) at the back.
This makes hind[] calculations much easier.
It makes _hxshift harder (i think), unless you store the record length
at the END of each HXREC. Hmmm ... ugly.
Tue Aug 28 2012
- UNDEX bit will indicate to _hxsave that, even though buffer has changed,
_hxindexify should NOT be called.
- Five cases for incremental hind[] update in hxput:
This is only done if delta >= 0 or currp->next == 0.
- make _hxfind set &hindpos.
- delete (newsize == 0, or found old record and new does not fit) when currp->recs > 1
AFTER _hxremove
- remove (zero) its hind entry, and move every following hind entry
up to the next zero into save[]
- move into save[] every entry that might move into the new delta space,
plus every following entry past the split up to the next zero.
if the last split-area entry was not zero.
... then the same as shrink:
- shrink i.e delta < 0
AFTER MEMMOVE
- increase hsize
- zero out the delta space
- reinsert save[].
- add (negative) delta to every hind entry that is > pos
- update i.e. delta == 0
- test that _hxsave does not call _hxindexify.
- grow i.e. delta > 0
BEFORE MEMMOVE
- add delta to all hind entries >= pos
- move hind[] entries in delta area to save[]
- if last hind[old_hsize-1] is nonzero, move every leading entry (starting at hind[0]) to save.
- add (may_find == 0) when recs > 0:
This is a pure append to [used]
- move hind[] entries in delta area to save[]
- if last hind[] entry is nonzero, move every leading entry (starting at hind[0]) to save.
- add entry to hind for new record
There are going to be some nasty edge cases here.
Mon Aug 27 2012
- need to fix corrupt_t
+ fix perf_x! it misloads records!
- now that perf_x works, _hxindexify is now the big bottleneck (1mq.tab)
Back we go to working out an incremental version of it?
- Add "UNDEX" bit to bufp->flag.
^ implement INCREMENTAL shrink/expand
Thu Aug 23 2012
+ speed up IS_MAP by memoizing map1 (setting it in hxopen!)
IS_MAP: pgno != 0 && (pgno < hp->map1 || ...)
Sun Aug 19 2012
+ use fastcall. It seems to cut hxput time in half, and push hxget time below 2 usec.
Wed Aug 15 2012
- update docs re hind.
+ review how two processes cannot deadlock on a shared tail.
- consider alternative locking schemes ... fcntl is the biggest ovhd for mmap.
Sat Aug 11 2012
x perftest the bsrl versus shift-and-or methods of _hxmask.
bsrl: 5 ops, >>|: 15 ops
Fri Aug 10 2012
- add easy coverage for:
- "Cancel a prior hxhold() for a different key"
- hxnext in update mode skips a record in a shared tail for some other head.
- hxcheck(/dev/null)
- hxcheck: bad: head_rec map_head
- map page with bits set beyond lastpage of file.
- _hxleave: something that sets tmpmap(!?)
- cover _hxlockset: when file size changed since the last _hxsize.
This is a nasty case of how to test something with two racing processes.
- cover _hxremap: when mmap returns a different pointer.
Wed Aug 08 2012
- biggest coverage hole in hxupd is hxgrow. Artefact of the tests??
Wed Aug 01 2012
- the way hxbuild saves and reloads the tail page is slightly wasteful.
+ make lockpart an ENUM
- add to hxfix: if a chain is >HX_MAX_CHAIN, break it.
Breaking a chain means following (and breaking) every link.
However, if the tail page is shared, breaking the link is
insufficient.
Tue Jul 31 2012
- put signature ("Hx") in the two unused bytes of HXROOT.
Thu Jul 19 2012
- chx del would be sped up by sorting on revbits(hash)
Wed Jul 18 2012
- strace test on doc suggests locking calls (and there are no conflicts)
are the single biggest system-call overhead.
$ strace -c ./perf_t labs.tab 24
analyze:0.119s nrecs:245809 nbytes:16773180 inpsize:472705029 hxbuild:120
parts:31 split:82997 middle:87382 vbuf:528K _split:320
split=10.827s store=108.158s hxbuild:184
hxputs: 0 0.000s hxbuild:251
build: 119.1084
mmap: 4D4F6000 697872384: Success hxopen:98
nrecs:6.42952e+06 succ-get:74.2578 fail-get:74.3304 put+11:0.5054 put-11:0.4738 (usec) put+100:0.4756 succ-get:75.4158 fail-get:76.0424
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.68 0.246146 0 61611688 fcntl
34.74 0.162310 0 52297771 lseek
5.73 0.026752 0 7223934 read
5.47 0.025550 0 172459 write
1.39 0.006496 171 38 munmap
_hxfind's loop is still the largest overhead by far.
- (taken from hxi/TODO, Wed 02 Mar 2011)
+ make pgrate a constant
+ add recs field to +HXPAGE, +HXROOT, +HXBUF
+ update recs
+ display it in chx hdrs et al.
+ verify it in hxfix. recs < used, recs > used: used trumps recs.
+ change FITS to (bytes,nrecs).
+ rewrite hxbuild to calc space reqs using INDEX_SIZE(nrecs)
+ change hxmaxrec to (pgsize-18):
18 = sizeof(HXPAGE including .recs) + sizeof(HXREC) + INDEX_SIZE(1)
+ any place that calculates max recs per page includes allowance for recs.
(hxbuild, hxfix, ... ?)
+ use revbits to compute pgno. (actually, just bswap32 is enough).
+ implement hxind[]. Populate it from scratch in _hxsave.
^ Add "UNDEX" bit to bufp->flag.
^ set UNDEX bit inside hxshift, hxsplit.
^ set UNDEX bit *outside* hxremove, hxappend.
x remove HXBUF.orig >> still needed for SHRUNK test in hxput and _hxshift
x reorder HXBUF fields to put (next,used,recs) at the front.
This will let Refs in hxshape use recs when folding atail/ztail.
+ make hxfix check hxind. For now, _hxsave will repair it anyway.
+ try to make hxmask faster, since it is now needed by hxind.
+ rationalize macros like INDEX_SIZE (count), INDEX_MIN (min-required)
+ calc start pos in hxind[] using linear hash formula, not just modulo hsize.
+ make _hxfind use hxind[]
- hxfix must not use _hxfind any more!
+ fix hxcheck tests for dup recs in adjacent pages of the chain.
This has always been lame, and now is skipped because we can no longer
trust hindex[] to be safe; hxfix repairs the index and saves the page.
^ implement INCREMENTAL shrink/expand
# Not sure how useful this is. Start by ONLY doing this in hxput,
when currp->next == 0 (no further page mods) and
currp->recs > 1 (page is not empty!)
# note that all hash entries have to be scanned anyway, to add "delta"
to their offset values if they are above the changed part of (used) bytes.
When scanning a memory range of locations to move (compact) when hxput needs
more space for a record, any entry followed by a zero is in the correct place,
and doesn't need any hash manipulation to calculate where it belongs
(subtract xwid/2 from it).
assert(xmax > 0)
for (i = n; i >= xmax; --i)
if (xtab[-i-1] == 0) insert(xtab[-i], i-xwid/2), --i
else j = xpos(*(HXHASH*)(page + xtab[-i]), xwid)
insert(xtab[-i], j)
insert(xtab[-i],
- update %used in hxstat to indicate based on existing nbytes/nrecs.
- perftest this mofo
Sat Jun 30 2012
- fast batch update: incremental packing may be faster than expanding.
Given a non-empty hxfile, a block of RAM, and an input stream
of updates, where each line is hx_save'd record prefixed by
"+" or "-".
non-empty hxfile, repeatedly:
- fill RAM with [+-]records
- sort records by [revbits(hash), seq].
- eliminate useless updates (only the last one per key counts).
- scan updates and file (pages affected by updates) to estimate the increase/decrease in file size
- reshape to that size (factoring a chunk out of hxshape),
- insert records without changing size -- using fine-grain locks,
and multiple insertions per block.
- postfill with hxput's
The ideal would be to make the edge case (empty input file)
close to hxbuild in speed. hxbuild has the advantage of
partitioning the entire input stream first. It estimates
final file size at the first memory-full point, because
the number of partitions depends on the final file size.
How about a version of hxput that cannot grow the file?
Or even alter the free map? This has higher concurrency
than hxput, since you only lock the pages of the chain head.
Repeat:
- a pass that inserts/deletes wherever will fit.
- sort by [revbits(hash), seq]; for keys with dup hashes,
check for equal keys and keep the last entry.
- update multiple records in a single pass through
a chain. updates of the freemap are okay.
- resize that estimates how much is needed for the remaining
inserts (use _hxgrow and entire file).
- insert remaining recs, plus any more records that can
be put in the queue?
Sat Jun 23 2012
^ we're back at _hxfind being the biggest part of hxget/hxput.
I'm leery of tackling this, because it makes updates more
complex. The in-record hash index uses 2.5 bytes per record.
Thu Jun 21 2012
- chx stat output gives a silly picture of near-empty files,
showing huge zero-length chains, "single-head" shared pages.
Sun May 20 2012
^ Fast batch updates suggest trying to maximize cache effect,
by sorting updates on reverse_bits(hash(x)).
The trick is to not do TOO much at one point, creating
long overflow chains. The answer requires knowing things
that the public API cannot; hence hxbuild.
The need to maintain a file lock for the duration can go away.
"hxbatch(hp, fp, opts)" that can process a stream of deletes
or a stream of puts: preserve the order of updates
(i.e. discard all but the last update).
opt bits:
- DELETE (else PUT);
- FILE lock, else granular page locks. Locking is granular
for head below split, too, since the object is to allow
concurrent use of the file.
Partition the input on a smaller scale.
First pass is allowed to insert into existing pages
(plus one shared-tail at a time).
Second pass begins by growing file to anticipated size,
after discounting free overflow pages in the map.
Sun May 13 2012
x fix locking btw different threads (using different hp's).
The problem is that fcntl-locks are inode-based, not file-handle based.
fcntl(SETLK) is just plain BROKEN:
qv. http://www.samba.org/samba/news/articles/low_point/tale_two_stds_os2.html
Get this: if the same file is opened twice (two fds),
closing one fd releases all locks held by the OTHER fd. Wow.
That's right. REALLY. So there's no pthread_mutex wrapper that
can save this turkey.
FreeBSD flock is better (fd-based) but has NO granularity.
!!! This means that you cannot hxclose(a) while another HXFILE is holding
a lock ... either because hxhold(b)/hxnext(b) is being used,
or some hx*(b) call is in progress in another thread.
Mon Apr 30 2012
x add hxput(replace) hxput(change) and hxput(new) and hxput(del) to perf_t
Tue Apr 10 PDT 2012
- hxupd is fundamental to everything else. Coverage has problems:
_hxputfreed never calls _hxflushfreed (i.e. double page-free)
135: 223: if (hp->locked & LOCKED_BEYOND && !(hp->locked & LOCKED_BODY))
#####: 224: _hxaddlock(hp, newpg);
Every case in _hxgrow except the simple one (1):
-: 243: switch (_hxshift(locp, oldpg, newpg, bufp, oldp, newp)) {
135: 286: if (oldp->next && !FILE_HELD(hp) && !IS_HEAD(hp, oldp->next))
#####: 288: _hxunlock(locp, oldp->next, 1);
Mon Apr 9 2012
^ now that the loop bug is fixed, the performance of hx vs hxi
pops to the top of the stack. Previous perf tests had been
with 40-50byte records, with about 10-25usec per hxget.s
Perf test with 1m.tab (~700b records) shows ~1-2usec "hxget".
So the linear-search has a significant cost, and a hash lookup
WITHIN each page would pay off for small records and be
insignificant for large pages (space overhead probably not
significant, since it's <20 bits/record).
Sun Apr 8 2012
+ fixed bug in _hxsave that created loop in mass loads like
hxbuild. Do not set tail_pgno=bufp->pgno when bufp->next is not zero.
Sat Apr 7 2012
+ making things like RECSIZE into inline functions would make
some asserts more readable :-)
Mon Apr 2 2012
+ hxbuild bug: the logic for weeding out dup records in _store is crap.
What was I thinking? Dup recs create a file that fails hxcheck.
Ironically, hxcheck only catches the dups when they are in different
pages.
Mon Mar 26 2012
+ hxput in hxbuild should not be locking anything; the file
should be under a FILE+BEYOND lock. Unfortunately, it IS
trying to lock individual pages, eventually overrunning
lockv[] and filling crap into hp->buffer,
making it look like hp->buffer.page is set to a non-null pointer,
and hxput's "if (SCANNING(hp) && ...)" blows up.
The problem is with _hxaddlock calls in _hxgrow.
+ when build_t or perf_t fails because 88.tab hasn't been created,
the error is a bit cryptic.
Sun Mar 25 2012
+ hxbuild has a bug: the logic for populating the map pages
blows on an assert in hx_load for any map page past the root.
Correct is to _hxfresh each root page (past the root).
Fri Mar 23 2012
^ implement HX_EXCLUSIVE: no locking, no resize of file
outside the api (hence HXLOCAL can live in HXFILE).
This is the Berkeley dbm model.
Tue Mar 6 2012
- use mremap in _hxremap!
Thu Mar 1 2012
- change HXHASH to uint64_t. See what breaks.
Files so big they need 64bit hashes prolly are too big to
need HXI; HXI is for when the file is mmap'd and HX is
competing with in-memory hash tables for lookup speed.
Fri Jan 27 2012
- make hx threadsafe:
fcntl locking is based on (pid,ino).
- declare a static (ptr to a) map {dev,ino => mutex,refcount}.
- hxopen/hxclose modify the map in the obvious ways.
A pointer to the map entry is stored in the HXFILE.
Assume a linked list is good enough.
Use compare_and_set to ensure the list is okay.
typedef struct {
__dev_t dev;
__ino_t ino;
pthread_mutex_t mux;
int refs;
} _HXMUTEX;
static _HXMUTEX *_hxmutexes;
fstat(hp->file, &st);
_HXMUTEX *mp, *mxp = NULL;
while (1) {
for (mp = _hxmutexes; mp && mp->key != curkey; mp = mp->next);
if (mp) {
// This is wrong: assume *mp can be freed while doing this.
while (mp && !compare_and_set(&mp->refs, mp->refs, mp->refs + 1);
if (mxp) pthread_mutex_destroy(&mxp->mux), free(mxp);
break;
}
if (!mxp) {
malloc(sizeof*mxp);
mxp->dev = st.st_dev;
mxp->ino = st.st_ino;
mxp->refs = 1;
pthread_mutex_init(&mxp->mux);
}
mxp->next = _hxmutexes->next;
if (compare_and_set(&_hxmutexes, _hxmutexes, mxp)) {
mp = mxp;
break;
}
}
// Similarly, deletion is necessary.
- unless I can figure out better granularity,
_hxlock/_hxunlock lock the ENTIRE FILE (the inode)
against any other thread.
Fri Dec 30 2011
- hxshape is the least cover'd function.
- _hxgrow() is the most important function for which to improve coverage!
Sun Dec 25 2011
# The whole HXI project is stopped by the hxcheck/hxfix issues.
# HXI will require hxfix to change; best to fix the problems first.
Sat Dec 24 2011
# hxcheck.c:xor_map is an obscure way to write getbit/setbit.
+ the only failing test now is corrupt_t
^ add to hxfix: if a chain is >HX_MAX_CHAIN, break it.
Breaking a chain means following (and breaking) every link.
However, if the tail page is shared, breaking the link is
insufficient.
^ problem #0 in hxfix: only detects dup recs in adj pages of a chain (previously known)
- problem #1 in hxfix: when a link is broken, every link
beyond that in the chain must also be broken,
else they will not get redone.
- problem #2 in hxfix: if there are dup recs in a chain,
hxfix doesn't fix it AT ALL. hxfix must break the link to the
page containing the dupkey of a rec earlier in the chain,
forcing the unreferenced pages to be redone.
- problem #3 in hxfix: if a dup rec is in a shared tail page,
breaking the link to that page is not enough to redo it.
You have to selectively edit the record out of the tail page
and into the spool file. Would running hxfix twice catch this?
Extend the scan for dup recs:
Allocate HX_MAX_CHAIN buffers in locp.
For each chain:
Create an in-memory hash hrec[ceillog2(hxmaxrec(hp)/HX_MIN_REC*HX_MAX_CHAIN)]
cells: (hash,pgno,offset).
Note that the loop-check cuts any loop in vnext[], even if !REPAIRING.
So this is safe to traverse.
For each page, for each rec:
if hash not in hrec[]
hrec += (hash,pgno,offset)
elif page#
- if page# is the same,
For each rec in curr, check whether it is in prev.
If so, cut off immediately,
elif its hash is in duphash[] for duphash[x].pgno != prev.pgno
then break prev link to curr and exit
else duphash += (hash,pgno,offset)
If you didn't reach the end of chain, bad_dup_recs++.
If !REPAIRING, bail out of the entire loop?
This looks suspiciously like it will leave a tail page
with a dup record in it that does not belong to the remaining
chains pointing to the tail page. URRRRRRR.
Fri Dec 23 2011
x check_data_page:308 catches (next) pointing at a head page,
but does not catch (next) pointing at a map page.
This may explain Tim's bug.
- Tim's bug indicates need for a lot more thrashy testing
on files in the <2MB range.
Tue Dec 20 2011
+ looks like next_t has a similar HX_MAX_CHAIN problem.
Side issue is that hxnext doesn't notice that the
chain is too long... it could be trapped in an infinite loop
by a corrupt file.
^ check_t and corrupt_t have problems, not sure what.
May be just brittle tests.
- ensure that hxput/hxbuild/hxshape never creates a chain > HX_MAX_CHAIN.
Mon Dec 19 2011
+ no need to do expensive REVBITS since, for a 4K page,
it makes no difference for any file less than 64GB,
and even then, records have to be <8 bytes!
# okay basic_t:136 'bug' is that chain length exceeds maxloops.
^ HXPAGE.recs changes
- _hxappend? increments
- _hxshift decrements srcp
- chx hdrs/dump reports
- hxcheck validates: recs>used, recs<used
- unit test validates hxfix/hxdump
Fri Dec 16 2011
+ aha! a better way: make _hxhead do the revbits().
+ bug: basic_t:136: hxput('1---') corrupts tmp.hx.
Not actually a bug: bad test.
- add a diagnostic mode that does no locking,
allowing read-only operations (i.e. chx)
to inspect a file partway through an update.
Tue Dec 13 2011
+ make pgrate a const 4
+ update (pgrate >>> version) to 0x0100 ... i.e. invalid >MAXPGRATE
+ store revbit(hash) instead of (hash)
+ add nrecs to HXPAGE/HXBUF.
^ add reindex-flag, and reindex to _hxsave (and _hxbuild and _hxfix)
Set flag in _hxshift etc.
^ _hxfind uses revhash and index
^ hxput does incremental index update
x any other performance enhancements?
Thu Dec 8 2011
^ Major change for hxi: store each rec's REVBIT(HXHASH) in the page, instead of its HXHASH.
This means that resizing the in-page hash table doesn't have to do a REVBIT
for every entry relocated or reindexed.
It makes the _hxsplit call in _hxgrow slightly different,
but otherwise does not change anything.
- change _hxmask to use (msutil.h) fls.
mask(n | n >= 0) :: min(x | x & x-1 == 0 and x >= n) = (1 << fls(n-1)) - 1
^ when scanning a memory range of locations to move (compact) when hxput needs
more space for a record, any entry followed by a zero is in the correct place,
and doesn't need any hash manipulation to calculate where it belongs
(subtract xwid/2 from it).
assert(xmax > 0)
for (i = n; i >= xmax; --i)
if (xtab[-i-1] == 0) insert(xtab[-i], i-xwid/2), --i
else j = xpos(*(HXHASH*)(page + xtab[-i]), xwid)
insert(xtab[-i], j)
insert(xtab[-i],
Tue Nov 29 2011
+ major bug on x86_64, starting with:
All std types (HXHASH, PAGENO) must be converted to fixed types (uint32_t etc).
+ _hxload:257: display (pgsize,pgrate) instead of "next=bigweirdnumber"
+ chx dump doesn't dump maps > root?
+ change hxcheck to take a single argument. The remainder are only relevant to hxfix.
- is there any point to locking bitmap pages other than the root?
Sat Nov 26 2011
+ use HXROOT.pgrate field as a version number; make pgrate = 2*2 permanently.
- add hxupgrade(hp) which converts type-4 to type-5 (hxi) files.
Use _hxtemp instead of taking an (fp) parameter.
Fri Nov 18 2011
x hxi: perhaps NOT use the entire space beyond (used) as an index, just 5*recs/4+1 shorts.
Does this make things simpler? Most inserts don't relocate hash entries.
OTOH growing an open-addressed hash table is not simple: every time you "split" a cell,
AND the occupant moves, you have to do the same for every cell beyond it
up to the next empty cell. This creates edge cases when (recs) is small.
Best to have an index system that is unit-testable.
Sat Nov 12 2011
x reduce the number of munmap/mmap calls due to filesize changes.
Pageno can be calculated knowing the file size, whether or not it is all mapped;
but the page only has to be mapped if you want to read/write it!
This suggests forcing _hxremap only after ftruncate, or when
_hxfresh/_hxload/_hxlink/_hxmove need to address a page beyond mlen.
This change should be effective, when the file is large,
and the probability of hitting the most recently-grown part of the file
is small.
Currently _hxremap uses munmap/mmap. Probably more efficient is to FIRST
try to force-map the grown part of the file to the memory beyond the
currently-mapped part of the file. If that fails, munmap (or multiple
pointers in HXFILE, with more complex MAP_BUF logic) is required.
# Although this seemed like a good idea, in practice it does nothing.
You can't safely just map the grown segment, because mmap actively
replaces anything mapped into the fixed address range you must specify.
And trying to map the file while the previous map exists will always
create a new address range. About all you can do is optimize
"shrink", which almost never happens.
Thu Nov 3 2011
^ hxi revisited:
- add (ushort recs) to HXPAGE, HXBUF
- compute it/update it in hxput and _hxappend.
(Only) hxshape uses _hxappend to copy multiple records at a time.
Workaround:
_hxload(locp, srcp, atail->pgno);
_hxload(locp, dstp, ztail->pgno);
_hxappend(dstp, srcp->data, srcp->used);
dstp->recs += srcp->recs - 1;
- make hxfix check recs<used, recs>used
- change FITS to allow (5*sizeof(short)/4 * 2 + 2)bytes extra for hash index space.
- add HXBUF fields (mask, shift) computed from (recs)
- add REINDEX bit to STAIN; recompute index from scratch in _hxsave
- make hxfix check index (rather than checking for all-zero bytes > used)
- use index for lookups
_hxload computes HXBUF.(xmask,xbits)
- perftest use of index for hxget on MMAP/!MMAP
- incremental index update:
In the following cases, first phase would just set the REINDEX
If _hxshift is called (and this buffer is either source or destination)
the REINDEX bit is set, and no incremental work is done.
- SHRINK (increase recs and used)
This is the case when hxput expands or adds a record.
It also means that hxshift is not going to be called
with this page as a source or target.
It is complicated by the scan that adds (delta)
to every entry > pos in the index.
Before copying data on top of index entries....
store (pos,delta)
If entry at &data[used] is nonzero, move
entries starting at &data[DATASIZE]-2 to fixup[],
replacing them with zeroes, until you reach a zero entry.
Push nonzero index entries about to be overwritten.
- EXPAND (decrease recs and used)
This could be followed by _hxshift into this buffer from
its successor (if any). So perhaps not worth doing unless
bp->next is zero. So, assuming bp->next is zero
If entry at &data[used] is nonzero, move entries
at end to fixup[] (zeroing index entries)
Do the same for entries starting at split point.
** These ranges may overlap.
Tue Nov 1 2011
X no coverage at all for recursive split!
That may be the cause of the Oct30 coredump
Nope, that was just a make failure.
Sun Oct 30 2011
- coredump in mamelog.hx update due to CHECK assert in _hxsave:
two MODIFIED buffers with same pageno. Unfortunately, this
is too late to determine WHY it got to that state.
+ Added asserts in _hxload to catch when a page is dirty in one buffer
and also read in another.
- Should do some similar check every time (pgno,flag) changes in any buffer.
Wed 18 May 2011
- hxbuild coverage (aside from LEAVE(HXERR_<syscall>)
- entire input fits in mem (the single-_store case)
- file with >1 map page
- input with duplicate keys
- an error in hxbuild requiring hxleave to munmap(tmpmap)
- hx.c coverage
- hxhold("alpha:bet") then hxput("azbuk:va")
- ?? force an mmap (remap) to relocate to a different starting page?
thus forcing active buffer page ptrs to need adjusting.
^ hxupd.c coverage
- the whole of case 3 in _hxgrow, including recursion
^ hxlox coverage
- the race condition where the file changes size between
the first size calc and the second (after the lock).
Mon 28 Mar 2011
- drop DISK_PGSIZE being a constant in hxbuild.
This means manually aligning vb[] to st.st_blksize of the tmpfile.
Mon 28 Feb 2011
^ okay t.perf shows that _hxfind is the biggest cost by far in hxget;
about half. This was using 4K pages and ~16b records => 22b.
Having a binary search rather than a linear search would make
this faster with (relatively) minor changes to the rest of hx.
What else would work? Would it require grouping the hash codes
along with the offsets to make it RAM-efficient?
Or move the WHOLE header into 8-byte array elements?
This becomes a lot less recoverable, since scribbling on the array
effectively destroys the records.
^ THE IDEA: use the space beyond (used) as an open hash table
keyed by the top byte of record hashes. The hash table is like
a linear-hash that is more often in a state of contraction.
Each time (used) changes, a small set of table entries have
to be rehashed.
^ This IMPLIES: need for a record-count field, so the page head
changes. So make PGRATE a constant and use the pgrate header
field as a version number. The version-5 header includes a
record count.
- implement chx save so that it can dump version-4 files,
but do nothing else with them.
+ FITS gets a delta-recs flag (>=0)
+ DECIDE on a hash overflow algorithm.
Hopscotch is overkill. 5/4 space means happy linear overflow.
That keeps resizing simple.
+ use all available (hxmaxrec-used) bytes as the hash table
The advantage of (b) is that, most of the time, you only
look at one entry (no overflows) because the hash table
is pretty sparse.
Thu 24 Feb 2011
x bug in hxbuild? No: ought to pay attention to the return code
(BAD_RECORD) because input records were >4KB.
Wed 23 Feb 2011
x add an opportunistic record index?
NO, an offset array at the end of the page, sorted by hash, gives
a very modest increase in speed, and maintaining it becomes a major