Chapter 10 Data Storage Structures¶
Abstract
- File Organization
- Fixed-Length Records
- Variable-Length Records
- Organization of Records in Files
- Heap File Organization
- Sequential File Organization
- Multitable Clustering File Organization
- Table Partition
- Data-Dictionary Storage
- Storage Access & Buffer manager
- LRU strategy
- Clock algorithm
- Columnar Representation
File Organization¶
- The database is stored as a collection of files.
- Each file is a sequence of records.
- A record is a sequence of fields.
- One approach:
- assume record size is fixed
- each file has records of one particular type only
- different files are used for different relations
- This case is easiest to implement; will consider variable length records later.
- We assume that records are smaller than a disk block
Fixed-Length Records¶
- Simple approach:
- Store record i starting from byte n * (i – 1), where n is the size of each record.
- Record access is simple but records may cross blocks
- Modification: do not allow records to cross block boundaries(硬件限制)
Deleltion¶
- Deletion of record i: alternatives:
- move records \(i + 1, . . ., n\) to \(i, . . . , n – 1\)
- move record n to i
- do not move records, but link all free records on a free list
Record 3 deleted
Record 3 deleted and replaced by record 11
List
Variable-Length Records¶
- Variable-length records arise in database systems in several ways:
- Storage of multiple record types in a file.
- Record types that allow variable lengths for one or more fields such as strings (varchar)
- Record types that allow repeating fields (used in some older data models).
- Attributes are stored in order
- Variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes
- Null values represented by null-value bitmap(空位图)
- Instructor(id,name, dept_name, salary)
这个图说明了空位图的使用,它用来表示记录的哪个属性是空值。在这个特定的记录中,如果salary是空值,该位图的第四位将置1,存储在12-19字节的salary值将被忽略。
在一些表示中,空位图存储在记录的开头,并且对于空属性不存储数据(值或偏移量/长度)。这种表示以抽取记录属性的额外工作为代价来节省存储空间,对于拥有大量字段,并且大多数都是空的特定应用,这样的表示特别有用
Variable-Length Records: Slotted Page Structure(分槽的页结构)¶
- Slotted page(分槽页) header contains:
- number of record entries(块头中记录条目的个数)
- end of free space in the block(块中空闲空间的末尾处)
- location and size of each record(一个由包含记录位置和大小的记录条目组成的数组)
- Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated.
- Record pointers should not point directly to record — instead they should point to the entry for the record in header.
Organization of Records in Files¶
- Heap – record can be placed anywhere in the file where there is space(堆文件组织)
- Sequential – store records in sequential order, based on the value of the search key of each record(顺序文件组织)
- In a multitable clustering file organization records of several different relations can be stored in the same file(多表聚簇文件组织)
- Motivation: store related records on the same block to minimize I/O
- B+-tree file organization
- Ordered storage even with inserts/deletes
- Hashing – a hash function computed on search key; the result specifies in which block of the file the record should be placed(散列文件组织)
Heap File Organization¶
- Records can be placed anywhere in the file where there is free space
- Records usually do not move once allocated
- Important to be able to efficiently find free space within file
- Free-space map
- Array with 1 entry per block. Each entry is a few bits to a byte, and records fraction of block that is free
- In example below, 3 bits per block, value divided by 8 indicates fraction of block that is free
- Can have second-level free-space map
- In example above, each entry stores maximum from 4 entries of first-level free-space map
- Free space map written to disk periodically, OK to have wrong (old) values for some entries (will be detected and fixed)
Sequential File Organization¶
顺序文件是为了高效处理按某个搜索码的顺序排序的记录而设计的
(search key)搜索码是任何一个属性或者属性的集合(甚至不一定是超码)
- Suitable for applications that require sequential processing of the entire file
- The records in the file are ordered by a search-key
维护物理顺序是困难的,所以我们选择使用链表结构
- Deletion – use pointer chains
- Insertion –locate the position where the record is to be inserted
- if there is free space insert there
- if no free space, insert the record in an overflow block
- In either case, pointer chain must be updated
- Need to reorganize the file from time to time to restore sequential order(重组,消耗很大,一般在系统负载很低的时候进行)
Multitable Clustering File Organization¶
很多关系数据库系统将每个关系存储在单独的文件里,以便它们可以利用操作系统所提供的文件系统的所有好处。
- Store several relations in one file using a multitable clustering file organization(在一块中存储两个或更多个关系的相关记录的文件结构,允许对连接的高效处理,但也会导致其他类型的查询变慢)
Example
department | instructor |
- good for queries involving join department \(\Join\) instructor, and for queries involving one single department and its instructors
- bad for queries involving only department
- results in variable size records
- Can add pointer chains to link records of a particular relation
Data Dictionary Storage(数据字典存储)¶
存储数据的数据(或者说是元数据),
The Data dictionary (also called system catalog) stores metadata; that is, data about data, such as
- Information about relations
- names of relations
- names, types and lengths of attributes of each relation
- names and definitions of views
- integrity constraints
- User and accounting information, including passwords
- Statistical and descriptive data
- number of tuples in each relation
- Physical file organization information
- How relation is stored (sequential/hash/…)
- Physical location of relation
- Information about indices
Relational Representation of System Metadata¶
- Relational representation on disk
- Specialized data structures designed for efficient access, in memory
Storage Access¶
Abstract
- Blocks are units of both storage allocation and data transfer.
- Database system seeks to minimize the number of block transfers between the disk and memory. We can reduce the number of disk accesses by keeping as many blocks as possible in main memory.
- Buffer– portion of main memory available to store copies of disk blocks.
- Buffer manager – subsystem responsible for allocating buffer space in main memory.
Buffer Manager in DBMS¶
数据库系统一般希望最大化要访问的块在主存储器中的几率,这样就不需要访问磁盘了。缓冲区(buffer)是主存储器用于存储磁盘块的拷贝的那一部分。每个块总有一个拷贝存放在磁盘上,但是磁盘上的可能比缓冲区里的旧,负责缓冲区分配的子系统成为缓冲区管理器(buffer manager)
-
Programs call on the buffer manager when they need a block from disk.
- If the block is already in the buffer, buffer manager returns the address of the block in main memory
- If the block is not in the buffer, the buffer manager
- Allocates space in the buffer for the block
- Replacing (throwing out) some other block, if required, to make space for the new block.
- Replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk.
- Reads the block from the disk to the buffer, and returns the address of the block in main memory to requester.
- Allocates space in the buffer for the block
-
Buffer replacement strategy (details coming up!)
- Pinned block: memory block that is not allowed to be written back to disk(被钉住的块,不允许写回,帮助修复系统)
- Pin done before reading/writing data from a block
- Unpin done when read /write is complete
- Multiple concurrent pin/unpin operations possible
- Keep a pin count, buffer block can be evicted only if pin count = 0
- Shared and exclusive locks on buffer
- Needed to prevent concurrent operations from reading page contents as they are moved/reorganized, and to ensure only one move/reorganize at a time
- Readers get shared lock, updates to a block require exclusive lock
- Locking rules:
- Only one process can get exclusive lock at a time
- Shared lock cannot be concurrently with exclusive lock
- Multiple processes may be given shared lock concurrently
LRU Example
- Supposing there are 4 buffer pages.
- Initial is empty occupied。
- The access sequence is 1, 4, 8, 1, 5, 2, 3, 2, 4.
- According to the LRU policy , there are 3 times replacement occurred.
- 最近最少使用
Buffer-Replacement Policies¶
- LRU strategy - replace the block least recently used .
- Idea behind LRU – use past pattern of block references as a predictor of future references
- Queries have well-defined access patterns (such as sequential scans), and a database system can use the information in a user’s query to predict future references
- LRU can be a bad strategy for certain access patterns involving repeated scans of data
- For example: when computing the join of 2 relations r and s by a nested loops
-
Mixed strategy with hints on replacement strategy provided by the query optimizer is preferable
-
Toss-immediate strategy – frees the space occupied by a block as soon as the final tuple of that block has been processed(立即丢弃策略)
- Most recently used (MRU) strategy – system must pin the block currently being processed. After the final tuple of that block has been processed, the block is unpinned, and it becomes the most recently used block.(最近最常使用)
- Buffer managers also support forced output of blocks for the purpose of recovery(块的强制写出)
Clock: An approximation of LRU¶
- Arrange block into a cycle, store one reference_bit per block
- When pin_count reduces to 0, set reference _bit =1
- reference _bit as the 2nd chance bit
When replacement necessary
do for each block in cycle{
if (reference_bit ==1)
set reference_bit=0;
else if (reference_bit ==0)
choose this block for replacement;
} until a page is chosen;
时钟置换算法其实是LRU的变种,看起来更加“优雅”,我们给每一个节点增加一个bool数据类型的REF属性,当节点对应页面被使用的时候,将REF置为true。我们还会创建一个clock的指针hand来周期性的指向缓存中的每一个节点。需要换出时,换出的是指针hand遇见的第一个REF=false的节点。当hand遇见一个REF=true的节点时,将REF置为false后,指向下一个节点。
关于关系数据存储有三种模型,按行存放(Row Store),按列存放(Column Store),以及论文提出的PAX模型在讨论区有探讨
Column-Oriented Storage(列式存储)¶
- Also known as columnar representation
- Store each attribute of a relation separately
Example
Columnar Representation(列状表示)¶
- Benefits:
- Reduced IO if only some attributes are accessed
- Improved CPU cache performance
- Improved compression
- Vector processing on modern CPU architectures
- Drawbacks
- Cost of tuple reconstruction from columnar representation
- Cost of tuple deletion and update
- Cost of decompression
- Columnar representation found to be more efficient for decision support than row-oriented representation
- Traditional row-oriented representation preferable for transaction processing
- Some databases support both representations
- Called hybrid row/column stores
Columnar File Representation¶
- ORC(Optimized Row Columnar) and Parquet
- File formats of Hadoop HDFS
- Apache ORC is an open source, aself-describing type-aware columnar file format designed for Hadoop workloads.
- Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
- File formats with columnar storage inside file
- Very popular for big-data applications
- Orc file format shown on right:
Storage Organization in Main-Memory Databases¶
- Can store records directly in memory without a buffer manager
- Column-oriented storage can be used in-memory for decision support applications
- Compression reduces memory requirement