Sort key in redshift

8/10/2023

#1: So to answer your first question the number of sort keys definitely matters.

Also this benefit is only worth the time of selecting on A and B independently are of equal importance to you. Interleaved sort keys have a window of table size where they will out perform compound. Whether the sort key are compound or interleaved doesn't matter a whole lot in this analysis. Similarly a 3rd sort key will need an even greater size of table (as compared to the number of slices in the cluster) to achieve the same benefit. However, the size of the table needs to be much much larger than in the A = 1 where clause to achieve the performance boost. If you have enough rows with A = 1 that the rows with B = 1 can fill up many blocks of data then there can still be good "block rejection". This will depend on how many rows are in your table and how big your cluster is (number of slices) (as well as the table distribution but let's not get overly complicated as this is complex enough). If you sort by A and then by B (also having the values 1 through 10) you may or may not get good "block rejection" on a query with the where clause "WHERE B = 1". The complexity comes in when you have multiple sort keys (interleaved or not). Assuming that the table has enough rows to fill up many more than 10 blocks (per slice) then the WHERE A = 1 clause can prevent the scanning of most of the table's data from disk. If A has values 1 through 10 and A is the only sort key then the first blocks for column A will have 1, then set set 2, and so on. The sorting of the table by its sortkeys will determine which rows are in which blocks and therefore what max and min values for the block will be represented in the blocks metadata. To provide significant value the metadata for all the blocks of column A need to have a fair percentage of these to not possibly contain the value 1. To provide any benefit there needs to be blocks for column A where the metadata indicates that A does not the value 1 for this block. A WHERE clause needs to be included in the query and the comparison has to be against the raw column value and a literal (like WHERE A = 1). This can only happen if the table metadata is valid for the blocks in question AND if the metadata shows that the data is not needed. If a large percentage of the blocks can be eliminated from the scan the query may be significantly sped up. If the table metadata indicates that the block is not needed for the query then this block will not be scanned (read) from disk. While many benefits can be achieved with sort keys (like merge joins) there is one dominating benefit that will make the biggest change (for large tables) and that is table scan optimization.

I can give you my quick answers to these questions but first let's understand what is trying to be achieved with sort keys.

0 Comments

Sort key in redshift

Leave a Reply.

Author

Archives

Categories