Distributed database system

Distributed database system

A distributed database is basically a database that is not limited to one system, it is spread over different sites, i.e, on multiple computers or over a network of computers. A distributed database system is located on various sites that don’t share physical components. This may be required when a particular database needs to be accessed by various users globally. It needs to be managed such that for the users it looks like one single database.

A Distributed Database Management System (DDBMS) contains a single logical database that is divided into a number of fragments.

Every fragment gets stored on one or more computers under the control of a separate DBMS, with the computers connected by a communications network.

Each position is capable of independently process every user’s requests that require access to local data (i.e., each position of the distributed system has some basic degree of local autonomy) and is also able to process data stored on other computers within the network.

Users access the distributed database via applications that are classified as those which do not need data from other sites (local applications); and also those that do need data from other sites (global applications).

A Distributed database therefore, has the following characteristics:

  • A Collection Of Logically Related Shared Data
  • The Data Is Split Into A Number Of Fragments
  • Fragments May Be Replicated
  • Fragments/Replicas Are Allocated To Sites
  • The Sites Are Linked By A Communications Network
  • The Data At Each Site Is Under The Control Of A Dbms
  • The Dbms At Each Site Can Handle Local Applications, Autonomously
  • Each Dbms Participates In At Least One Global Application

Distributed Database Architectures

DDBMS architectures are generally developed depending on three parameters −

  • Distribution − It states the physical distribution of data across the different sites.
  • Autonomy − It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently.
  • Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components and databases.

Architectural Models

Some of the common architectural models are −

  • Client – Server Architecture for DDBMS
  • Peer – to – Peer Architecture for DDBMS
  • Multi – DBMS Architecture

Client – Server Architecture for Distributed database

This is a two-level architecture where the functionality is divided into servers and clients. The server functions primarily encompass data management, query processing, optimization and transaction management.

Client functions include mainly user interface. However, they have some functions like consistency checking and transaction management.

The two different client – server architecture are −

  • Single Server Multiple Client
  • Multiple Server Multiple Client (shown in the following diagram)

 

Distributed database Single Server Multiple Client

 

Peer- to-Peer Architecture for distributed database System

In these systems, each peer acts both as a client and a server for imparting distributed database services. The peers share their resource with other peers and co-ordinate their activities.

This architecture generally has four levels of schemas −

  • Global Conceptual Schema − Depicts the global logical view of distributed database system.
  • Local Conceptual Schema − Depicts logical data organization at each site in the distributed database system.
  • Local Internal Schema − Depicts physical data organization at each site in the distributed database system.
  • External Schema − Depicts user view of distributed database.

Distributed database Global Conceptual Schema

 

Multi – DBMS Architectures of distributed database system

This is an integrated database system formed by a collection of two or more autonomous database systems.

Multi-DBMS can be expressed through six levels of schemas −

  • Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated distributed database.
  • Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical multi-database structure definitions.
  • Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to local data mapping.
  • Local database View Level − Depicts public view of local data in distributed database system.
  • Local database Conceptual Level − Depicts local data organization at each site in distributed database system.
  • Local database Internal Level − Depicts physical data organization at each site in distributed database system.

There are two design alternatives for multi-DBMS −

  • Model with multi-database conceptual level.
  • Model without multi-database conceptual level.

 

Model with multi-database conceptual level in distributed database system

Distribute database Multi database conceptual schema

 

Model without multi-database conceptual level in distributed database system

 

Distributed database Without multi database conceptual schema

 

 

Levels of Distribution Transparency in distributed database

Distribution transparency is the property of distributed databases. Due to this property the internal details of the distribution are hidden from the users.

The three dimensions of distribution transparency are −

  • Location transparency
  • Fragmentation transparency
  • Replication transparency

Location Transparency in distributed database

Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as if they were stored locally in the user’s site.

The fact that the table or its fragments are stored at remote site in the distributed database system, should be completely unaware to the end user.

The address of the remote site(s) and the access mechanisms are completely hidden.

In order to incorporate location transparency, DDBMS should have access to updated and accurate data dictionary and DDBMS directory which contains the details of locations of data.

 

Fragmentation Transparency in distributed database

Fragmentation transparency enables users to query upon any table as if it were un fragmented.

Thus, it hides the fact that the table the user is querying on is actually a fragment or union of some fragments.

It also conceals the fact that the fragments are located at diverse sites.

This is somewhat similar to users of SQL views, where the user may not know that they are using a view of a table instead of the table itself.

 

Replication Transparency in distributed database

Replication transparency ensures that replication of databases are hidden from the users.

It enables users to query upon a table as if only a single copy of the table exists.

Replication transparency is associated with concurrency transparency and failure transparency.

  • Whenever a user updates a data item, the update is reflected in all the copies of the table. However, this operation should not be known to the user. This is concurrency transparency.
  • Also, in case of failure of a site, the user can still proceed with his queries using replicated copies without any knowledge of failure. This is failure transparency.

 

Combination of Transparencies

In any distributed database system, the designer should ensure that all the stated transparencies are maintained to a considerable extent.

The designer may choose to fragment tables, replicate them and store them at different sites; all hidden to the end user.

However, complete distribution transparency is a tough task and requires considerable design efforts.

 

Replication

Data Replication

Data replication is the process of storing separate copies of the database at two or more sites.

It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication

  • Reliability − In case of failure of any site, the database system continues to work since a copy is available at another site(s).
  • Reduction in Network Load − Since local copies of data are available, query processing can be done with reduced network usage, particularly during prime hours.

Data updating can be done at non-prime hours.

  • Quicker Response − Availability of local copies of data ensures quick query processing and consequently quick response time.
  • Simpler Transactions − Transactions require less number of joins of tables located at different sites and minimal coordination across the network. Thus, they become simpler in nature.

Disadvantages of Data Replication

  • Increased Storage Requirements − Maintaining multiple copies of data is associated with increased storage costs.

The storage space required is in multiples of the storage required for a centralized system.

  • Increased Cost and Complexity of Data Updating − Each time a data item is updated, the update needs to be reflected in all the copies of the data at the different sites.

This requires complex synchronization techniques and protocols.

  • Undesirable Application – Database coupling − If complex update mechanisms are not used, removing data inconsistency requires complex co-ordination at application level.

This results in undesirable application – database coupling

 

Distributed database design – Fragmentation

Fragmentation

Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are called fragments.

Fragmentation can be of three types:

  • horizontal,
  • vertical, and
  • hybrid (combination of horizontal and vertical).

 

Horizontal fragmentation can further be classified into two techniques:

primary horizontal fragmentation and derived horizontal fragmentation.

Fragmentation should be done in a way so that the original table can be reconstructed from the fragments.

This is needed so that the original table can be reconstructed from the fragments whenever required. This requirement is called re-constructiveness.”

Advantages of Fragmentation

  • Since data is stored close to the site of usage, efficiency of the database system is increased.
  • Local query optimization techniques are sufficient for most queries since data is locally available.
  • Since irrelevant data is not available at the sites, security and privacy of the database system can be maintained.

Disadvantages of Fragmentation

  • When data from different fragments are required, the access speeds may be very high.
  • In case of recursive fragmentation, the job of reconstruction will need expensive techniques.
  • Lack of back-up copies of data in different sites may render the database ineffective in case of failure of a site.

Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into fragments.

In order to maintain reconstructiveness, each fragment should contain the primary key field(s) of the table.

Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance to values of one or more fields.

Horizontal fragmentation should also confirm to the rule of reconstructiveness

Each horizontal fragment must have all columns of the original base table.

 

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are used.

This is the most flexible fragmentation technique since it generates fragments with minimal extraneous information.

However, reconstruction of the original table is often an expensive task.

Hybrid fragmentation can be done in two alternative ways −

  • At first, generate a set of horizontal fragments; then generate vertical fragments from one or more of the horizontal fragments.
  • At first, generate a set of vertical fragments; then generate horizontal fragments from one or more of the vertical fragments.

 

 

Allocation criteria,

Database control refers to the task of enforcing regulations so as to provide correct data to authentic users and applications of a database.

In order that correct data is available to users, all data should conform to the integrity constraints defined in the database.

Besides, data should be screened away from unauthorized users so as to maintain security and privacy of the database. Database control is one of the primary tasks of the database administrator (DBA).

The three dimensions of database control are −

  • Authentication
  • Access rights
  • Integrity constraints

Authentication

In a distributed database system, authentication is the process through which only legitimate users can gain access to the data resources.

Authentication can be enforced in two levels −

  • Controlling Access to Client Computer − At this level, user access is restricted while login to the client computer that provides user-interface to the database server. The most common method is a username/password combination.

However, more sophisticated methods like biometric authentication may be used for high security data.

  • Controlling Access to the Database Software − At this level, the database software/administrator assigns some credentials to the user.

The user gains access to the database using these credentials.

One of the methods is to create a login account within the database server.

Access Rights

A user’s access rights refers to the privileges that the user is given regarding DBMS operations such as the rights to create a table, drop a table, add/delete/update tuples in a table or query upon the table.

In distributed environments, since there are large number of tables and yet larger number of users, it is not feasible to assign individual access rights to users. So, DDBMS defines certain roles. A role is a construct with certain privileges within a database system. Once the different roles are defined, the individual users are assigned one of these roles. Often a hierarchy of roles are defined according to the organization’s hierarchy of authority and responsibility.

 

Translation of Global Queries / Global Query Optimisation, Query Execution and access plan

When a query is placed, it is at first scanned, parsed and validated. An internal representation of the query is then created such as a query tree or a query graph.

Then alternative execution strategies are devised for retrieving results from the database tables.

The process of choosing the most appropriate execution strategy for query processing is called query optimization.

 

Query Optimization Issues in distributed database system

In distributed database, query optimization is a crucial task.

The complexity is high since number of alternative strategies may increase exponentially due to the following factors −

  • The presence of a number of fragments in distributed database.
  • Distribution of the fragments or tables across various sites.
  • The speed of communication links.
  • Disparity in local processing capabilities in distributed database.

Hence, in a distributed system, the target is often to find a good execution strategy for query processing rather than the best one.

The time to execute a query is the sum of the following −

  • Time to communicate queries to databases.
  • Time to execute local query fragments.
  • Time to assemble data from different sites.
  • Time to display results to the application.

Query Processing

Query processing is a set of all activities starting from query placement to displaying the results of the query.

The steps are as shown in the following diagram −

Distributed database query optimization

 

Concurrency control in distributed database system

Concurrency control – 2 phase locks

Concurrency controlling techniques ensure that multiple transactions are executed simultaneously while maintaining the ACID properties of the transactions and serializability in the schedules.

There are various approaches for concurrency control as discussed below-

Locking Based Concurrency Control Protocols

Locking-based concurrency control protocols use the concept of locking data items.

A lock is a variable associated with a data item that determines whether read/write operations can be performed on that data item.

Generally, a lock compatibility matrix is used which states whether a data item can be locked by two transactions at the same time.

Locking-based concurrency control systems can use either one-phase or two-phase locking protocols.

 

  • One-phase Locking Protocol

In this method, each transaction locks an item before use and releases the lock as soon as it has finished using it.

This locking method provides for maximum concurrency but does not always enforce serializability.

  • Two-phase Locking Protocol

In this method, all locking operations precede the first lock-release or unlock operation.

The transaction comprise of two phases.

  • growing phase

In the first phase, a transaction only acquires all the locks it needs and do not release any lock. This is called the expanding or the growing phase.

  • shrinking phase.

In the second phase, the transaction releases the locks and cannot request any new locks. This is called the shrinking phase.

Every transaction that follows two-phase locking protocol is guaranteed to be serializable. However, this approach provides low parallelism between two conflicting transactions.

 

Distributed Two-phase Locking Algorithm in distributed database system

The basic principle of distributed two-phase locking is same as the basic two-phase locking protocol. However, in a distributed system there are sites designated as lock managers.

A lock manager controls lock acquisition requests from transaction monitors.

In order to enforce co-ordination between the lock managers in various sites, at least one site is given the authority to see all transactions and detect lock conflicts.

Depending upon the number of sites who can detect lock conflicts, distributed two-phase locking approaches can be of three types −

  • Centralized two-phase locking − In this approach, one site is designated as the central lock manager. All the sites in the environment know the location of the central lock manager and obtain lock from it during transactions.
  • Primary copy two-phase locking − In this approach, a number of sites are designated as lock control centers. Each of these sites has the responsibility of managing a defined set of locks. All the sites know which lock control center is responsible for managing lock of which data table/fragment item.
  • Distributed two-phase locking − In this approach, there are a number of lock managers, where each lock manager controls locks of data items stored at its local site. The location of the lock manager is based upon data distribution and replication.

 

Timestamp Concurrency Control Algorithms

Timestamp-based concurrency control algorithms use a transaction’s timestamp to coordinate concurrent access to a data item to ensure serializability

A timestamp is a unique identifier given by DBMS to a transaction that represents the transaction’s start time.

These algorithms ensure that transactions commit in the order dictated by their timestamps.

An older transaction should commit before a younger transaction, since the older transaction enters the system before the younger one.

Timestamp-based concurrency control techniques generate serializable schedules such that the equivalent serial schedule is arranged in order of the age of the participating transactions.

Some of timestamp based concurrency control algorithms are −

  • Basic timestamp ordering algorithm.
  • Conservative timestamp ordering algorithm.
  • Multiversion algorithm based upon timestamp ordering.

Timestamp based ordering follow three rules to enforce serializability −

  • Access Rule − When two transactions try to access the same data item simultaneously, for conflicting operations, priority is given to the older transaction. This causes the younger transaction to wait for the older transaction to commit first.
  • Late Transaction Rule − If a younger transaction has written a data item, then an older transaction is not allowed to read or write that data item. This rule prevents the older transaction from committing after the younger transaction has already committed.
  • Younger Transaction Rule − A younger transaction can read or write a data item that has already been written by an older transaction.

 

Distributed Timestamp Concurrency Control in distributed database system

In a centralized system, timestamp of any transaction is determined by the physical clock reading. But, in a distributed system, any site’s local physical/logical clock readings cannot be used as global timestamps, since they are not globally unique. So, a timestamp comprises of a combination of site ID and that site’s clock reading.

For implementing timestamp ordering algorithms, each site has a scheduler that maintains a separate queue for each transaction manager. During transaction, a transaction manager sends a lock request to the site’s scheduler. The scheduler puts the request to the corresponding queue in increasing timestamp order. Requests are processed from the front of the queues in the order of their timestamps, i.e. the oldest first.

 

 

Distributed deadlocks in distributed database system

What are Deadlocks?

Deadlock is a state of a database system having two or more transactions, when each transaction is waiting for a data item that is being locked by some other transaction.

Deadlock Handling in Distributed Systems

Transaction processing in a distributed database system is also distributed, i.e. the same transaction may be processing at more than one site.

The two main deadlock handling concerns in a distributed database system that are not present in a centralized system are

transaction location and transaction control.”

Once these concerns are addressed, deadlocks are handled through any of deadlock prevention, deadlock avoidance or deadlock detection and removal methods.

 

Transaction Location in  distributed database system

Transactions in a distributed database system are processed in multiple sites and use data items in multiple sites.

The amount of data processing is not uniformly distributed among these sites.

The time period of processing also varies.

Thus the same transaction may be active at some sites and inactive at others.

When two conflicting transactions are located in a site, it may happen that one of them is in inactive state.

This condition does not arise in a centralized system. This concern is called transaction location issue.

This concern may be addressed by Daisy Chain model

 

Daisy Chain model in distributed database system

In this model, a transaction carries certain details when it moves from one site to another.

Some of the details are :-

  • the list of tables required,
  • the list of sites required,
  • the list of visited tables and sites,
  • the list of tables and sites that are yet to be visited
  • and the list of acquired locks with types.

After a transaction terminates by either commit or abort, the information should be sent to all the concerned sites.

 

Transaction Control in distributed database system

Transaction control is concerned with designating and controlling the sites required for processing a transaction in a distributed database system.

There are many options regarding the choice of where to process the transaction and how to designate the center of control, like −

  • One server may be selected as the center of control.
  • The center of control may travel from one server to another.
  • The responsibility of controlling may be shared by a number of servers.

 

“non-blocking” commitment protocol

A “non-blocking” commitment protocol is one that ensures that at least some sites of a multi-site transaction do not block in spite of any single failure.

 

Partitioned networks, Check points and Cold starts

Network partition

It refers to network decomposition into relatively independent subnets for their separate optimization as well as network split due to the failure of network devices.

In both cases the partition-tolerant behavior of subnets is expected.

This means that even after the network is partitioned into multiple sub-systems, it still works correctly.

For example, in a network with multiple subnets where nodes A and B are located in one subnet and nodes C and D are in another, a partition occurs if the network switch device between the two subnets fails.

In that case nodes A and B can no longer communicate with nodes C and D, but all nodes A-D work the same as before.

In some cases, detecting and handling network partitions are not very straightforward. Because these kind of network partitions don’t cause a clearly separated, distinct groups. But they create overlapping partitioned groups or worse, asymmetric communication failures between members. This can be called as partial network partitioning.

Last two figures are the examples partial network partitioning. The solution to this problem is to figure out the largest set of fully-connected members and artificially separate these members from the rest.

This way we will have a completely separated set of members without any intersection.

 

Checkpoint in distributed database system

  • The checkpoint is a type of mechanism where all the previous logs are removed from the system and permanently stored in the storage disk.
  • The checkpoint is like a bookmark. While the execution of the transaction, such checkpoints are marked, and the transaction is executed then using the steps of the transaction, the log files will be created.
  • When it reaches to the checkpoint, then the transaction will be updated into the database, and till that point, the entire log file will be removed from the file. Then the log file is updated with the new step of transaction till next checkpoint and so on.
  • The checkpoint is used to declare a point before which the DBMS was in the consistent state, and all transactions were committed.

Backward error recovery calls for the system to routinely save its state onto stable storage in a fault-tolerant distributed system. We need to take a distributed snapshot, often known as a consistent global state, in particular.

If a process P has recorded the receipt of a message in a distributed snapshot, then there should also be a process Q that has recorded the sending of that message. It has to originate somewhere, after all.

Each process periodically saves its state to a locally accessible stable storage in backward error recovery techniques. We must create a stable global state from these local states in order to recover from a process or system failure. Recovery to the most current distributed snapshot, also known as a recovery line, is recommended in particular. In other words, as depicted in Fig., a recovery line represents the most recent stable cluster of checkpoints.

Coordinated Checkpointing : 

As the name suggests, coordinated checkpointing synchronises all processes to write their state to local stable storage at the same time. Coordinated checkpointing’s key benefit is that the saved state is automatically globally consistent, preventing cascading rollbacks that could cause a domino effect.

Message Logging :

The core principle of message logging is that we can still obtain a globally consistent state even if the transmission of messages can be replayed, but without having to restore that state from stable storage. Instead, any communications that have been sent since the last checkpoint are simply retransmitted and treated appropriately.

As system executes, messages are recorded on stable storage. A message is called as logged if its data and index of stable interval that is stored are both recorded on stable storage.

In above Fig. you can see logged and unlogged images denoted by different arrows. The idea is if transmission of messages is replayed, we can still reach a globally consistent state. so we can recover logs of messages and continue the execution.

 

Cold start

Cold start in computing refers to a problem where a system or its part was created or restarted and is not working at its normal operation.

The problem can be related to initializing internal objects or populating cache or starting up subsystems.

 

Quorum-based voting in commit protocols

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system.

A quorum-based technique is implemented to enforce consistent operation in a distributed system

Quorum-based voting can be used as a replica control method, as well as a commit method to ensure transaction atomicity in the presence of network partitioning.

In a distributed database system, a transaction could execute its operations at multiple sites.

Since atomicity requires every distributed transaction to be atomic, the transaction must have the same fate (commit or abort) at every site.

In case of network partitioning, sites are partitioned and the partitions may not be able to communicate with each other.

This is where a quorum-based technique comes in. The fundamental idea is that a transaction is executed if the majority of sites vote to execute it.

Every site in the system is assigned a vote Vi.

Let us assume that the total number of votes in the system is V (5) and the abort and commit quorums are Va(2) and Vc(3), respectively.

Then the following rules must be obeyed in the implementation of the commit protocol:

  1. Va + Vc >= V, where 0 < Vc, Va ≤<= V
  2. Before a transaction commits, it must obtain a commit quorum Vc.
    The total of at least one site that is prepared to commit and zero or more sites waiting ≥ Vc.
  3. Before a transaction aborts, it must obtain an abort quorum Va
    The total of zero or more sites that are prepared to abort or any sites waiting ≥Va.

The first rule ensures that a transaction cannot be committed and aborted at the same time.

The next two rules indicate the votes that a transaction has to obtain before it can terminate one way or the other.

Node and link failure recoveries

Network Failure

Network failures are prevalent in distributed or network databases.

These comprises of the errors induced in the database system due to the distributed nature of the data and transferring data over the network.

The causes of network failure are as follows −

  • Communication link failure.
  • Network congestion.
  • Information corruption during transfer.
  • Site failures.
  • Network partitioning.

Commit Protocols

Any database system should guarantee that the desirable properties of a transaction are maintained even after failures. If a failure occurs during the execution of a transaction, it may happen that all the changes brought about by the transaction are not committed.

This makes the database inconsistent. Commit protocols prevent this scenario using either transaction undo (rollback) or transaction redo (roll forward).

 

  What is data dictionary in a dbms? / How do we use data dictionary? / Data dictionary, system catalog, metadata / Define data dictionary

 

Data Dictionary

Data dictionary is a centralized repository that stores definitions about the data elements and their relationships.

The DBMS uses the data dictionary to look up the required data component structures and relationships.

Data dictionary is referred with different names like ‘system catalogue’, and ‘meta data’ (data about data).

 

Any changes made in a database structure are automatically recorded in the data dictionary, thereby freeing you from having to modify all of the programs that access the changed structure.

For example, Oracle modifies the data dictionary every time a DDL (Data Definition Language) statement is executed.

 

What does it store?

 

  • It stores information about,
  • various data elements (database objects) like
    • tables [table names, attributes, attribute size, data types etc],
    • views [view definition],
    • indexes [all indexes and the tables in which they are defined],
    • procedures,
    • functions,
    • triggers etc.
  • integrity constraints information
  • users and privileges
  • space management data etc.

 

Example:

 

If you create a table then you are using a DDL statement. This statement will create the table as per the definition and updates the data dictionary by storing the following information;

  • table name, column names, data types, sizes, constraints etc.
  • user privileges as per the user who created that table

Need of Data Dictionary

  • A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. Often a data dictionary is a centralized metadata repository.
  • A first step in analyzing a system of interactive objects is to identify each one and its relationship to other objects. This process is called data modeling and results in a picture of object relationships.
  • After each data object or item is given a descriptive name, its relationship is described, or it becomes part of some structure that implicitly describes relationship. The type of data, such as text or image or binary value, is described, possible predefined default values are listed and a brief textual description is provided. This data collection can be organized for reference into a book called a data dictionary
  • When developing programs that use the data model, a data dictionary can be consulted to understand where a data item fits in the structure, what values it may contain and what the data item means in real-world terms. For example, a bank or group of banks could model the data objects involved in consumer banking. They could then provide a data dictionary for a bank’s programmers. The data dictionary would describe each of the data items in its data model for consumer banking, such as “Account holder” and “Available credit.”

Types of data dictionaries

There are two types of data dictionaries. Active and passive data dictionaries differ in level of automatic synchronization.

  • Active data dictionaries. These are data dictionaries created within the databases they describe automatically reflect any updates or changes in their host databases. This avoids any discrepancies between the data dictionaries and their database structures.
  • Passive data dictionaries. These are data dictionaries created as new databases — separate from the databases they describe — for the purpose of storing data dictionary information. Passive data dictionaries require an additional step to stay in sync with the databases they describe and must be handled with care to ensure there are no discrepancies.

Data dictionary components

Specific contents in a data dictionary can vary. In general, these components are various types of metadata, providing information about data.

  • Data object listings (names and definitions)
  • Data element properties (such as data type, unique identifiers, size, nullability, indexes and optionality)
  • Entity-relationship diagrams (ERD)
  • System-level diagrams
  • Reference data
  • Missing data and quality-indicator codes
  • Business rules (such as for validation of data quality and schema objects)

How to create a data dictionary

When planning to create a data dictionary, it is important to consider all available data management resources, including databases and spreadsheets.

Most database management systems (DBMSes), as well as information systems created by computer-aided software engineering (CASE) tools, contain integrated active data dictionaries.

For example, the Analyzer tool for Microsoft Access — which analyzes and documents databases — can be used to create a data dictionary from Access-based or Access-connected data.

If a machine-readable data dictionary cannot be automatically generated, it is suggested to submit a data dictionary from a single source as a spreadsheet.

Within Excel, .XLS or .XLSX spreadsheets can be made into data dictionaries. Online templates are useful for creating this type of data dictionary.

Pros and cons of data dictionaries

Data dictionaries can be a valuable tool for the organization and management of large data listings. Other pros include:

  • Provides organized, comprehensive list of data
  • Easily searchable
  • Can provide reporting and documentation for data across multiple programs
  • Simplifies the structure for system data requirements
  • No data redundancy
  • Maintains data integrity across multiple databases
  • Provides relationship information between different database tables
  • Useful in the software design process and test cases

Though they provide thorough listings of data attributes, data dictionaries may be difficult to use for some users. Other cons include:

  • Functional details not provided
  • Not visually appealing
  • Difficult to understand for non-technical users

 

Distributed Database Administration

Some of the important responsibilities of distributed database administrator are as follows-

  • Site Autonomy
  • Distributed Database Security
  • Auditing Database Links
  • Administration Tools

 

  • Site Autonomy

Site autonomy means that each server participating in a distributed database is administered independently from all other databases.

Although several databases can work together, each database is a separate repository of data that is managed individually.

Some of the benefits of site autonomy in an Oracle Database distributed database include:

  • Nodes of the system can mirror the logical organization of companies or groups that need to maintain independence.
  • Local administrators control corresponding local data. Therefore, each database administrator’s domain of responsibility is smaller and more manageable.
  • Independent failures are less likely to disrupt other nodes of the distributed database. No single database failure need halt all distributed operations or be a performance bottleneck.
  • Administrators can recover from isolated system failures independently from other nodes in the system.
  • A data dictionary exists for each local database. A global catalog is not necessary to access local data.
  • Nodes can upgrade software independently.

Although Oracle Database permits you to manage each database in a distributed database system independently, you should not ignore the global requirements of the system. For example, you may need to:

  • Create additional user accounts in each database to support the links that you create to facilitate server-to-server connections.
  • Set additional initialization parameters, etc. in order to work on the distributed database.

 

Distributed Database Security

The database supports all of the security features as discussed below:

  • Password authentication for users and roles
  • Some types of external authentication for users and roles including:
    • Kerberos version 5 for connected user links
    • DCE for connected user links
  • Login packet encryption for client-to-server and server-to-server connections

The following sections explain some additional topics to consider when configuring an Oracle Database distributed database system:

  • Authentication Through Database Links
  • Authentication Without Passwords
  • Supporting User Accounts and Roles
  • Centralized User and Privilege Management
  • Data Encryption

 

Authentication Through Database Links

Database links are either private or public, authenticated or non-authenticated. You create public links by specifying the PUBLIC keyword in the link creation statement.

 

Authentication Without Passwords

When using a connected user or current user database link, you can use an external authentication source such as Kerberos to obtain end-to-end security.

In end-to-end authentication, credentials are passed from server to server and can be authenticated by a database server belonging to the same domain.

For example, if jane is authenticated externally on a local database, and wants to use a connected user link to connect as herself to a remote database, the local server passes the security ticket to the remote database.

Supporting User Accounts and Roles

In a distributed database system, you must carefully plan the user accounts and roles that are necessary to support applications using the system. Note that:

  • The user accounts necessary to establish server-to-server connections must be available in all databases of the distributed database system.
  • The roles necessary to make available application privileges to distributed database application users must be present in all databases of the distributed database system.

As you create the database links for the nodes in a distributed database system, determine which user accounts and roles each site needs to support server-to-server connections that use the links.

In a distributed environment, users typically require access to many network services. When you must configure separate authentications for each user to access each network service, security administration can become unwieldy, especially for large systems.

 

Centralized User and Privilege Management

The database provides different ways for you to manage the users and privileges involved in a distributed system. For example, you have these options:

  • Enterprise user management. You can create global users who are authenticated through SSL or by using passwords, then manage these users and their privileges in a directory through an independent enterprise directory service.
  • Network authentication service. This common technique simplifies security management for distributed environments. You can use the Oracle Advanced Security option to enhance Oracle Net and the security of an Oracle Database distributed database system. Windows NT native authentication is an example of a non-Oracle authentication solution.

Schema-Dependent Global Usersbr

One option for centralizing user and privilege management is to create the following:

  • A global user in a centralized directory
  • A user in every database that the global user must connect to

For example, you can create a global user called fred with the following SQL statement:

CREATE USER fred IDENTIFIED GLOBALLY AS ‘CN=fred adams,O=Oracle,C=England’;

This solution allows a single global user to be authenticated by a centralized directory.

The schema-dependent global user solution has the consequence that you must create a user called fred on every database that this user must access. Because most users need permission to access an application schema but do not need their own schemas, the creation of a separate account in each database for every global user creates significant overhead. Because of this problem, the database also supports schema-independent users, which are global users that an access a single, generic schema in every database.

Schema-Independent Global Users

The database supports functionality that allows a global user to be centrally managed by an enterprise directory service. Users who are managed in the directory are called enterprise users. This directory contains information about:

  • Which databases in a distributed system an enterprise user can access
  • Which role on each database an enterprise user can use
  • Which schema on each database an enterprise user can connect to

The administrator of each database is not required to create a global user account for each enterprise user on each database to which the enterprise user needs to connect. Instead, multiple enterprise users can connect to the same database schema, called a shared schema

Data Encryption

The Oracle Advanced Security option also enables Oracle Net and related products to use network data encryption and checksumming so that data cannot be read or altered. It protects data from unauthorized viewing by using the RSA Data Security RC4 or the Data Encryption Standard (DES) encryption algorithm.

To ensure that data has not been modified, deleted, or replayed during transmission, the security services of the Oracle Advanced Security option can generate a cryptographically secure message digest and include it with each packet sent across the network.

Auditing Database Linksb

You must always perform auditing operations locally. That is, if a user acts in a local database and accesses a remote database through a database link, the local actions are audited in the local database, and the remote actions are audited in the remote database, provided appropriate audit options are set in the respective databases.

The remote database cannot determine whether a successful connect request and subsequent SQL statements come from another server or from a locally connected client. For example, assume the following:

  • Fixed user link hq.acme.com connects local user jane to the remote hq database as remote user scott.
  • User scott is audited on the remote database.

Actions performed during the remote database session are audited as if scott were connected locally to hq and performing the same actions there. You must set audit options in the remote database to capture the actions of the username–in this case, scott on the hq database–embedded in the link if the desired effect is to audit what jane is doing in the remote database

Administration Tools

The database administrator has several choices for tools to use when managing an Oracle Database distributed database system:

  • Enterprise Manager
  • Third-Party Administration Tools
  • SNMP Support

Enterprise Manager

Enterprise Manager is the Oracle Database administration tool that provides a graphical user interface (GUI). Enterprise Manager provides administrative functionality for distributed databases through an easy-to-use interface. You can use Enterprise Manager to:

  • Administer multiple databases. You can use Enterprise Manager to administer a single database or to simultaneously administer multiple databases.
  • Centralize database administration tasks. You can administer both local and remote databases running on any Oracle Database platform in any location worldwide. In addition, these Oracle Database platforms can be connected by any network protocols supported by Oracle Net.
  • Dynamically execute SQL, PL/SQL, and Enterprise Manager commands. You can use Enterprise Manager to enter, edit, and execute statements. Enterprise Manager also maintains a history of statements executed.

Thus, you can reexecute statements without retyping them, a particularly useful feature if you need to execute lengthy statements repeatedly in a distributed database system.

  • Manage security features such as global users, global roles, and the enterprise directory service.

Third-Party Administration Tools

Currently more than 60 companies produce more than 150 products that help manage Oracle Databases and networks, providing a truly open environment.

SNMP Support

Besides its network administration capabilities, Oracle Simple Network Management Protocol (SNMP) support allows an Oracle Database server to be located and queried by any SNMP-based network management system. SNMP is the accepted standard underlying many popular network management systems such as:

  • HP OpenView
  • Digital POLYCENTER Manager on NetView
  • IBM NetView/6000
  • Novell NetWare Management System
  • SunSoft SunNet Manager

 

Thanks for reading, Welcome to your comments on this Post

This site uses Akismet to reduce spam. Learn how your comment data is processed.