@clickhouse_ru

« Назад

Страница 11 из 723

Далее »

isdown

18.11.2016
09:19:17

or split 8 columns to 4 blocks

can anyone help me?

Vitaliy

18.11.2016
09:29:14

In described case a single block containing all 8 columns will be created. Block will contain a header (number of rows and colum datatypes info) and columns' data (data fo 1st column, for 2nd and so on ...). Since you have only one row, data of each column will contain single element (Int, String ...)

isdown

18.11.2016
09:39:00

Thank you

Google

Vitaliy

18.11.2016
09:41:55

Worth noting that big table will be stored as set of blocks with the same structure, but their length (i.e. number of rows) could vary

isdown

18.11.2016
09:46:23

How to determine the number of columns and the number of rows should be put in a block? I found the block size in the config file,but do not know how to balance the number of columns and number of rows in a block. If I save million rows data in table,then what is the struct of the block

Vitaliy

18.11.2016
10:05:02

Could you describe your task more preciesly? Columns are determined by structure of your data (schema). ClickHouse engine try automatically choose optimal length of blocks inside own internal pipelines, but you can adjust it via max_block_size max_insert_block_size config parameters. Optimal values of these parameters is determined by particular task. The best strategy to choose them is to test performance of your use case on different parameters set.

isdown

18.11.2016
10:39:11

https://gist.github.com/sunisdown/5901db41db8d5aaacf05432a6274db58 I have 49 columns like this, I want to know the struct of block, and why. ?

Виктор

18.11.2016
10:41:28

Why do you need that?

Actually, your structure do not affect blocks size

Block size is internal thing and decided according to performance reasons

By default it's 8192 and it's best for most cases

Ah, sorry, it's 65536

Do you have any performance interest in that?

isdown

18.11.2016
10:52:44

yep

Виктор

18.11.2016
10:53:39

What's ColumnA here

And what's 'row'

Google

isdown

18.11.2016
10:55:34

Block 0 store just one column, just like TinyLog

Виктор

18.11.2016
10:56:49

ClickHouse do not store data like that

Blocks are perpendicular to columns

Block size is pretty simple: it's just how many rows will be loaded and processed as a whole

isdown

18.11.2016
10:59:07

Thank you

Виктор

18.11.2016
10:59:34

Hope that was helpful :)

isdown

18.11.2016
10:59:41

I got it

the block is like row-store database？

Виктор

18.11.2016
11:04:46

Nope

Data is stored as a columns

But when you need to read and process the data

Let's say you're processing 2 columns with 1.000.000 rows

Then data will be read and processed from disk in blocks of 'block size'

So it will read 65536 rows from 3 separate columns and then they will be processed in one process call

Why we need that is because we can't process all the data because we need to divide processing

Better now?

isdown

18.11.2016
11:18:36

row data is the smallest unit inside the block incorrect? Or we will split row data into a few columns on the combination of different blocks inside

Виктор

18.11.2016
11:26:34

Data is always splitted into columns, it's column-oriented store

isdown

18.11.2016
11:28:32

yep, My doubts is how to split it, how the combination of different columns to block inside

Виктор

18.11.2016
11:29:36

that's not dependent things

Google

Виктор

18.11.2016
11:30:19

what matters is how many columns you use in requests

and what usual select size in terms of rows

If it's small maybe you should lower block size

And btw we're talking about max_block_size setting right?

isdown

18.11.2016
11:31:19

Nope.

Виктор

18.11.2016
11:34:11

Ugh, so what we're talking about? =)

isdown

18.11.2016
11:34:39

My doubts is how to split it, how the combination of different columns to block inside

Виктор

18.11.2016
11:36:42

Wha

To split when?

When you insert data?

isdown

18.11.2016
11:37:16

yep

Виктор

18.11.2016
11:37:45

Ah, that's different

There is parameter

max_insert_block_size

It's around million by default and that's fine

So you can insert data up to this parameter and that will be totally fine

And again it's totally unrelated to columns

isdown

18.11.2016
11:43:58

When I insert a data, This data will be splited, and then store into different blocks, what writen to the each block is a column, or a combination of columns?

Fike

18.11.2016
11:44:37

as far as i've understood, columns are stored separate from each other, and discussed block settings doesn't relate to storage directly

isdown

18.11.2016
11:46:32

Block does not store data? Just as an intermediate query?

Google

Fike

18.11.2016
11:47:37

(if, again, i understand everything correctly) each column is a separate entity of storage, and when you store an entity, it is split into columns, and each columns receives new record like {id: <record id>, value: <record value>}. If that's correct (feel free to correct me if i'm wrong), there is no such term as block on column-level, it's just stream of such values.

isdown

18.11.2016
11:52:32

Thank you very much

Roman

18.11.2016
11:55:31

@the_real_jkee is it correct to consider such 'blocks' as transactions for insert?

Виктор

18.11.2016
12:10:20

Yes, absolutely

snapshot isolation guarantees works inside this 'blocks'

So if you upload 'max_insert_block_size' data it will be guaranteed to stored as a whole (or you will have an error)

Roman

18.11.2016
12:12:59

So if you upload 'max_insert_block_size' data it will be guaranteed to stored as a whole (or you will have an error)

thank you :) so feel free to answer everybody clickhouse really has transactions :)

Виктор

18.11.2016
12:13:34

Nope, please do not do that =)

That's the only case where we can say 'it's a transaction'

Roman

18.11.2016
12:14:28

i guess it's enough for analytical rdbms

Виктор

18.11.2016
12:15:39

It is

Roman

18.11.2016
12:18:21

think a bit how to describe it better in promo materials -- all guys i know who heard about YCH thinks that it absolutely has no transactions even for bulk insert

Виктор

18.11.2016
12:21:02

ah, okay

thanks for feedback

there is also a question about distributed transaction

should that transaction mean durability in terms of server loss or not

Which is actually not so difficult but hard to explain

Roman

18.11.2016
12:28:25

Which is actually not so difficult but hard to explain

you mean how many nodes apply the same block?

Виктор

18.11.2016
12:28:42

I mean what the guarantees when you insert data

For example you have replica set of 3

Google

Виктор

18.11.2016
12:29:06

Your data is replicated to 3 nodes

And you insert data to one node

Should it be 'at least 2' or 'one' is enough?

Roman

18.11.2016
12:29:36

For example you have replica set of 3

single OK after insert is not enough? ah yes

Виктор

18.11.2016
12:30:02

OK is enough, question is what this OK means =)

Roman

18.11.2016
12:30:05

Should it be 'at least 2' or 'one' is enough?

i guess it should be derived from replica factor

Виктор

18.11.2016
12:30:16

Nope, that's different

replica factor is how many times your data is copied

Fike

18.11.2016
12:30:32

we had a conversation like this here

we're about to break hell loose

Roman

18.11.2016
12:31:12

replica factor is how many times your data is copied

i know. and replica factor is a basic factor of durability

Виктор

18.11.2016
12:31:39

Generally there are 2 options: async replication and sync replication

async is default in clickhouse, and it says 'insert in one is OK'

Roman

18.11.2016
12:32:06

so if snapshot is saved by 'replica factor' nodes this insert is really ok

Виктор

18.11.2016
12:32:11

sync replication usually is really slow

It's not. It's OK in terms of guarantees but slowdown is like 10 times easily

So in most cases async replication is totally fine.

Roman

18.11.2016
12:33:04

So in most cases async replication is totally fine.

yes

is it possible to track 'insert block id' and their statuses

to answer clients async if they requests this status

« Назад

Страница 11 из 723

Далее »

Открыть в Telegram