MᏫᎻᎯᎷᎷᎬᎠ
It's my way for asking questions
Ludovic 'Archivist'
it means apply a transformation to every element of the dataset
Ludovic 'Archivist'
reduce is equivalent to std::accumulate
Ludovic 'Archivist'
To where?
In the case of Hadoop, to the stream
Ludovic 'Archivist'
it can also be used to filter in Hadoop
MᏫᎻᎯᎷᎷᎬᎠ
reduce is equivalent to std::accumulate
Yeah Transform and gather all of them
Ludovic 'Archivist'
exactly the mentality
MᏫᎻᎯᎷᎷᎬᎠ
Ludovic 'Archivist'
Cassandra is a distributed database that allow you to use SQL on distributed tables
Ludovic 'Archivist'
MᏫᎻᎯᎷᎷᎬᎠ
Or filter while searching the data?
MᏫᎻᎯᎷᎷᎬᎠ
yes
Why would someone do that?!
Ludovic 'Archivist'
Hadoop is many bad things (like mainly not always the best solution) but at the very least it is flexible
MᏫᎻᎯᎷᎷᎬᎠ
Why would someone do that?!
Like it's heavy
Ludovic 'Archivist'
Why would someone do that?!
let's say that you have a dataset of customers (like, 10M customers) and want to select all the customers that made purchases higher than 5000$, give them a discount and get the total amount of discount you gave in total
Ludovic 'Archivist'
10M is large
We don't call it big data for show
MᏫᎻᎯᎷᎷᎬᎠ
So we filter them while transforming
MᏫᎻᎯᎷᎷᎬᎠ
Then
MᏫᎻᎯᎷᎷᎬᎠ
Filtering while searching is better right?
Ludovic 'Archivist'
It is not better but it is good
MᏫᎻᎯᎷᎷᎬᎠ
I imagine transforming filter as gather the whole dataset and filter them in their road
Ludovic 'Archivist'
databases generally prefer to index the data to have it relatively presorted
MᏫᎻᎯᎷᎷᎬᎠ
that is exactly what it is
So If you wanna filter a dataset in this way You'll have to get them all first
MᏫᎻᎯᎷᎷᎬᎠ
Like collecting a 100G users and them filter then, what??
Ludovic 'Archivist'
So If you wanna filter a dataset in this way You'll have to get them all first
and this is why you need big ass servers with hundreds of GB of RAM
MᏫᎻᎯᎷᎷᎬᎠ
But still I see searching filter is better
MᏫᎻᎯᎷᎷᎬᎠ
It will mostly fit with normal rams
Ludovic 'Archivist'
But still I see searching filter is better
Prefiltered data is even better
Ludovic 'Archivist'
but prefiltered data is harder to paralellize operations on
MᏫᎻᎯᎷᎷᎬᎠ
Prefiltered data is even better
It's the same as searching filter, right?
Ludovic 'Archivist'
It's the same as searching filter, right?
no, it is the same as a Database index
Ludovic 'Archivist'
so a sorted collection of the keys in a database
MᏫᎻᎯᎷᎷᎬᎠ
no, it is the same as a Database index
Hold a minute, Database index isn't an algorithm!!! It's a name of some static thing
Ludovic 'Archivist'
exactly
Ludovic 'Archivist'
it is a premade search system
MᏫᎻᎯᎷᎷᎬᎠ
exactly
Searching filter is an algorithm
Ludovic 'Archivist'
it allows you to search the data instantaneously without much overhead
Ludovic 'Archivist'
and cherrypick the data without filtering it explicitly
MᏫᎻᎯᎷᎷᎬᎠ
Ludovic 'Archivist'
yes, like an actual book index
Ludovic 'Archivist'
and the bloom filter i mentionned earlier is a way to know if a key is or not in a database (oversimplification)
Ludovic 'Archivist'
it takes a considerable amount of memory to build and maintain one
Ludovic 'Archivist'
no
MᏫᎻᎯᎷᎷᎬᎠ
Ohh
Ludovic 'Archivist'
it is basically, a structure that tells you if a key is definitely not in the database
MᏫᎻᎯᎷᎷᎬᎠ
Just in a quick way Instead of searching the whole database
Ludovic 'Archivist'
Just in a quick way Instead of searching the whole database
you try to match you key vs the filter, if it matches, the data may be in the database, if even one point doesn't match the data definitely isn't
MᏫᎻᎯᎷᎷᎬᎠ
Of course it's in the database
Ludovic 'Archivist'
Of course it's in the database
no, but it likely is in
MᏫᎻᎯᎷᎷᎬᎠ
Of course it's in the database
Unless you mean the key is deleted while matching the "filtered data key" with my key
MᏫᎻᎯᎷᎷᎬᎠ
Ludovic 'Archivist'
it is not a data structure that tells you if the data is in
Ludovic 'Archivist'
just if it isn't
Ludovic 'Archivist'
it can have false positives but no false negatives
MᏫᎻᎯᎷᎷᎬᎠ
This is encrypted xD
MᏫᎻᎯᎷᎷᎬᎠ
Anyway
MᏫᎻᎯᎷᎷᎬᎠ
Thanks Ludo
Ludovic 'Archivist'
This is encrypted xD
let's get a simple example of the concept
Ludovic 'Archivist'
you have 5 vowels a(0) e(0) i(0) o(0) u(0)
MᏫᎻᎯᎷᎷᎬᎠ
Ludovic 'Archivist'
you add the word "yes" in the set
Ludovic 'Archivist'
MᏫᎻᎯᎷᎷᎬᎠ
Ludovic 'Archivist'
you add the word "yes" in the set
upon adding the word, you have yes that matches 'e'
Ludovic 'Archivist'
so you chage the filter to change to a(0) e(1) i(0) o(0) u(0)
MᏫᎻᎯᎷᎷᎬᎠ
Yeah
Ludovic 'Archivist'
you want to check if 'add' is in the set
Ludovic 'Archivist'
add matches a