forked from besquared/redis-datastructures
-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
119 lines (97 loc) · 4.83 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
PartitionedTable
A partitioned table stores all of its rows into 'buckets'
which are defined by specifying a set of fields to partition on.
Exact matches on any subset of the partitioned fields can then
be executed quickly using redis' glob capability.
The performance is good when using the 'load' command since it saves
on having to marshal many small objects by bulk marshaling an entire
set of rows at one time. This property makes partitioned tables best
for bulk loading and reading.
Preliminary Benchmarks
class ConversionRates < PartitionedTable::Base
name "benchmark_conversion_rates"
fields :datasets, :versions, :conversions, :hours, :totals, :successes
partitions :datasets, :hours
end
Mac Mini
OS X 10.5 Leopard
1.66 GHz Intel Core Duo (2MB L2 Cache)
2GB 667 MHz DDR2 SDRAM
Ruby 1.8.6 patchlevel 114
Redis-rb
Loading 50,000 rows
1.280000 0.070000 1.350000 ( 1.429340)
--------------------------------------------------
Finding all purchases
Length: 16667
0.120000 0.020000 0.140000 ( 0.175329)
--------------------------------------------------
Finding all purchases during hour 100
Length: 70
0.000000 0.000000 0.000000 ( 0.001119)
--------------------------------------------------
Finding purchases for 336 hours
Length: 7835
0.070000 0.030000 0.100000 ( 0.214801)
--------------------------------------------------
Finding 336 hours of data
Length: 23504
0.250000 0.070000 0.320000 ( 0.501853)
--------------------------------------------------
Finding all data
Length: 50000
0.510000 0.050000 0.560000 ( 0.673403)
--------------------------------------------------
After building about 10 different versions of in memory and redis backed
data structures I need to write down my overall thinking regarding this.
The main questions are
1) What is the purpose?
2) When and how will the data be loaded?
3) When and how will the data be accessed?
1) What is the purpose?
My initial goal was something like an "OLAP cube" which would allow me to
pivot around certain dimensions and display their related measures or other
quantities related to them.
2) When and how will the data be loaded?
The data is not operational, it is historical. This is built to support
advanced reporting interfaces. The data is loaded offline on a schedule and
is generally not streamed in real time in any way. This means that load times
are less important than read times. It also means there will be a lot of data,
distribution of keys is important in order to take advantage of the distributed
nature of key value stores. Structures that put all of the data into one list
are not a good choice.
3) When and how will the data be accessed?
The data is accessed in "real time", people are seeing loaders, it needs to be
fast. The less requests we have to do the better because network latency is our
enemy here and our data backend is written in C. The less processing of data we
have to do to get our results in a usable state the better.
Given these requirements there are a few conclusions
1) Do things in bulk where it's possible
It's faster to deal with a 1Kb string than to deal with 1024 1 byte strings
2) Do things with the least number of keys that still provide "distribution"
Pretend every key lookup is lost revenue, this is close to the truth
3) Do things directly rather than with indirection
The benefits of indirection are often overshadowed by the overhead involved
in multi-stage processes, especially across networks. A corollary to this is
do things in one place only.
------------------------------------------------------
The partitioned storage strategy is as good as we can do
No duplication of data. The partitioning splits the set of rows into
non-overlapping subsets. This ensures that we aren't wasting space in
our in-memory system. It also ensures that we don't waste load or read
time on duplicate data.
Partitioning creates *just* the right amount of keys. The same as the
cross product of all of the values of the fields we are partitioning. For
instance if we are partitioning on 3 fields each with a cardinality of 100
then the total number of keys we'll create is 1,000,000. This means lookups
can be distributed as soon as we have a single partition field.
Exact matching across all partitions in constant time. If we're looking
for rows and we know the values of all of our partitions ahead of time then
we can get the entire set of rows we care about in just 1 key look up.
This also means that if we're looking for two exact matches we can do this
in 2 key lookups. This is as good as we can get without duplicating data or
creating more indirection to store multiple keys with a single row.
In this sense, the partitioned storage strategy is optimal for many cases
where exact matches are necessary. Range matches can be synthesized using
exact matches by simply asking for one exact match for each value in some
specified range of values.