
isdown
18.11.2016
09:19:17
or split 8 columns to 4 blocks
can anyone help me?

Vitaliy
18.11.2016
09:29:14
In described case a single block containing all 8 columns will be created. Block will contain a header (number of rows and colum datatypes info) and columns' data (data fo 1st column, for 2nd and so on ...). Since you have only one row, data of each column will contain single element (Int, String ...)

isdown
18.11.2016
09:39:00
Thank you

Google

Vitaliy
18.11.2016
09:41:55
Worth noting that big table will be stored as set of blocks with the same structure, but their length (i.e. number of rows) could vary

isdown
18.11.2016
09:46:23
How to determine the number of columns and the number of rows should be put in a block?
I found the block size in the config file,but do not know how to balance the number of columns and number of rows in a block.
If I save million rows data in table,then what is the struct of the block

Vitaliy
18.11.2016
10:05:02
Could you describe your task more preciesly? Columns are determined by structure of your data (schema). ClickHouse engine try automatically choose optimal length of blocks inside own internal pipelines, but you can adjust it via max_block_size max_insert_block_size config parameters.
Optimal values of these parameters is determined by particular task. The best strategy to choose them is to test performance of your use case on different parameters set.

isdown
18.11.2016
10:39:11
https://gist.github.com/sunisdown/5901db41db8d5aaacf05432a6274db58 I have 49 columns like this, I want to know the struct of block, and why.
?

Виктор
18.11.2016
10:41:28
Why do you need that?
Actually, your structure do not affect blocks size
Block size is internal thing and decided according to performance reasons
By default it's 8192 and it's best for most cases
Ah, sorry, it's 65536
Do you have any performance interest in that?

isdown
18.11.2016
10:52:44
yep

Виктор
18.11.2016
10:53:39
What's ColumnA here
And what's 'row'

Google

isdown
18.11.2016
10:55:34
Block 0 store just one column, just like TinyLog

Виктор
18.11.2016
10:56:49
ClickHouse do not store data like that
Blocks are perpendicular to columns
Block size is pretty simple: it's just how many rows will be loaded and processed as a whole

isdown
18.11.2016
10:59:07
Thank you

Виктор
18.11.2016
10:59:34
Hope that was helpful :)

isdown
18.11.2016
10:59:41
I got it
the block is like row-store database?

Виктор
18.11.2016
11:04:46
Nope
Data is stored as a columns
But when you need to read and process the data
Let's say you're processing 2 columns with 1.000.000 rows
Then data will be read and processed from disk in blocks of 'block size'
So it will read 65536 rows from 3 separate columns and then they will be processed in one process call
Why we need that is because we can't process all the data because we need to divide processing
Better now?

isdown
18.11.2016
11:18:36
row data is the smallest unit inside the block incorrect? Or we will split row data into a few columns on the combination of different blocks inside

Виктор
18.11.2016
11:26:34
Data is always splitted into columns, it's column-oriented store

isdown
18.11.2016
11:28:32
yep, My doubts is how to split it, how the combination of different columns to block inside

Виктор
18.11.2016
11:29:36
that's not dependent things

Google

Виктор
18.11.2016
11:30:19
what matters is how many columns you use in requests
and what usual select size in terms of rows
If it's small maybe you should lower block size
And btw we're talking about max_block_size setting right?

isdown
18.11.2016
11:31:19
Nope.

Виктор
18.11.2016
11:34:11
Ugh, so what we're talking about? =)

isdown
18.11.2016
11:34:39
My doubts is how to split it, how the combination of different columns to block inside

Виктор
18.11.2016
11:36:42
Wha
To split when?
When you insert data?

isdown
18.11.2016
11:37:16
yep

Виктор
18.11.2016
11:37:45
Ah, that's different
There is parameter
max_insert_block_size
It's around million by default and that's fine
So you can insert data up to this parameter and that will be totally fine
And again it's totally unrelated to columns

isdown
18.11.2016
11:43:58
When I insert a data,
This data will be splited, and then store into different blocks,
what writen to the each block is a column, or a combination of columns?

Fike
18.11.2016
11:44:37
as far as i've understood, columns are stored separate from each other, and discussed block settings doesn't relate to storage directly

isdown
18.11.2016
11:46:32
Block does not store data? Just as an intermediate query?

Google

Fike
18.11.2016
11:47:37
(if, again, i understand everything correctly) each column is a separate entity of storage, and when you store an entity, it is split into columns, and each columns receives new record like {id: <record id>, value: <record value>}. If that's correct (feel free to correct me if i'm wrong), there is no such term as block on column-level, it's just stream of such values.

isdown
18.11.2016
11:52:32
Thank you very much

Roman
18.11.2016
11:55:31
@the_real_jkee is it correct to consider such 'blocks' as transactions for insert?

Виктор
18.11.2016
12:10:20
Yes, absolutely
snapshot isolation guarantees works inside this 'blocks'
So if you upload 'max_insert_block_size' data it will be guaranteed to stored as a whole (or you will have an error)

Roman
18.11.2016
12:12:59

Виктор
18.11.2016
12:13:34
Nope, please do not do that =)
That's the only case where we can say 'it's a transaction'

Roman
18.11.2016
12:14:28
i guess it's enough for analytical rdbms

Виктор
18.11.2016
12:15:39
It is

Roman
18.11.2016
12:18:21
think a bit how to describe it better in promo materials -- all guys i know who heard about YCH thinks that it absolutely has no transactions even for bulk insert

Виктор
18.11.2016
12:21:02
ah, okay
thanks for feedback
there is also a question about distributed transaction
should that transaction mean durability in terms of server loss or not
Which is actually not so difficult but hard to explain

Roman
18.11.2016
12:28:25

Виктор
18.11.2016
12:28:42
I mean what the guarantees when you insert data
For example you have replica set of 3

Google

Виктор
18.11.2016
12:29:06
Your data is replicated to 3 nodes
And you insert data to one node
Should it be 'at least 2' or 'one' is enough?

Roman
18.11.2016
12:29:36

Виктор
18.11.2016
12:30:02
OK is enough, question is what this OK means =)

Roman
18.11.2016
12:30:05

Виктор
18.11.2016
12:30:16
Nope, that's different
replica factor is how many times your data is copied

Fike
18.11.2016
12:30:32
we had a conversation like this here
we're about to break hell loose

Roman
18.11.2016
12:31:12

Виктор
18.11.2016
12:31:39
Generally there are 2 options: async replication and sync replication
async is default in clickhouse, and it says 'insert in one is OK'

Roman
18.11.2016
12:32:06
so if snapshot is saved by 'replica factor' nodes this insert is really ok

Виктор
18.11.2016
12:32:11
sync replication usually is really slow
It's not. It's OK in terms of guarantees but slowdown is like 10 times easily
So in most cases async replication is totally fine.

Roman
18.11.2016
12:33:04
is it possible to track 'insert block id' and their statuses
to answer clients async if they requests this status