Distributed DBMS – Features, Needs and Architecture

Distributed DBMS – The DBMS that manages a distributed database is called distributed DBMS.

Distributed database

Distributed database definition

A distributed database is a collection of logically related information that is spread over the sites of a computer network.

Distributed DBMS definition

A Distributed Database Management System contains a single logical database that is divided into a number of fragments.

Every fragment of the database is stored in various computers under the control of a separate DBMS, with the computers connected by a communications network.

Each site is capable of independently process every user’s requests that require access to local data, it means each site of the distributed system has some degree of local autonomy and therefore capable to process data that is stored on different computers within the network.

Users access the distributed files through software applications.

These applications are of two types-

Applications that do not need data from other sites (local applications);
Applications that need data from other sites (global applications).

Characteristics of Distributed DBMS

A Distributed DBMS has the following characteristics:

Distributed DBMS is a Collection Of Logically Related Shared Data or files
In Distributed DBMS The Data Is Split Into A Number Of Fragments, horizontal fragments, and vertical fragments
In Distributed DBMS all Fragments May Be Replicated on different sites
In Distributed DBMS All the Fragments or Replicas Are Allocated To different Sites
In Distributed DBMS the Sites Are Linked By A Communications Network
The Data At Each Site Is Under The Control Of A database management system
The database management system at Each Site Can Handle Local Applications, Autonomously
Each database management system Participates In At Least One Global Application

Needs of Distributed Database

There are several reasons why distributed databases are developed. Some of the common reasons are as below-

Organizational needs
Economic reasons (Storing and managing largely centralized is a challenging task economically particularly if organization functionality is widely distributed)
Incremental growth
Reduced communication load
Performance consideration
Reliability and availability

Distributed DBMS Architectures

Distributed DBMS architectures are generally developed depending on three parameters −

Distribution − It states the physical distribution of data across different sites.
Autonomy − It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently.
Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components, and databases.

Architectural Models of Distributed DBMS

Some of the common architectural models of Distributed DBMS are −

Client-Server Architecture for Distributed DBMS
Peer – to – Peer Architecture for Distributed DBMS
Multi – DBMS Architecture of Distributed DBMS

Client-Server Architecture for Distributed DBMS

This is a two-level architecture where the functionality is divided into servers and clients.

The server functions primarily encompass data management, query processing, optimization, and transaction management. Client functions include mainly user interface.

However, they have some functions like consistency checking and transaction management.

The two different clients – server architecture are −

Single Server Multiple Client
Multiple Server Multiple Client

Single Server Multiple Client

Multiple Server Multiple Client

Peer- to-Peer Architecture for Distributed DBMS

In these systems, each peer acts both as a client and a server for imparting database services. The peers share their resources with other peers and co-ordinate their activities.

This architecture generally has four levels of schemas −

Global Conceptual Schema − Depicts the global logical view of data.
Local Conceptual Schema − Depicts logical data organization at each site.
Local Internal Schema − Depicts physical data organization at each site.
External Schema − Depicts the user’s view of data.

Multi – Distributed DBMS Architectures

This is an integrated database system formed by a collection of two or more autonomous database systems.

Multi- Distributed DBMS can be expressed through six levels of schemas −

Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated distributed database.
Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical multi-database structure definitions.
Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to local data mapping.
Local database View Level − Depicts a public view of local data.
Local database Conceptual Level − Depicts local data organization at each site.
Local database Internal Level − Depicts physical data organization at each site.

There are two design alternatives for multi- Distributed DBMS −

A model with a multi-database conceptual level.
Model without multi-database conceptual level.

A model with a multi-database conceptual level

Model without multi-database conceptual level

Levels of Distribution Transparency in Distributed Database

Distribution transparency is an important property of a distributed database.

Due to this property, the internal details of the distribution are hidden from the users.

The three dimensions of distribution transparency are −

Location transparency
Fragmentation transparency
Replication transparency

Location Transparency

Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as if they were stored locally in the user’s site.

The fact that the table or its fragments are stored at remote sites in the distributed database system, should be completely unaware to the end-user.

The address of the remote site(s) and the access mechanisms are completely hidden.

In order to incorporate location transparency, DDBMS should have access to an updated and accurate data dictionary and DDBMS directory which contains the details of locations of data.

Fragmentation Transparency

Fragmentation transparency enables users to query upon any table as if it were unfragmented.

Thus, it hides the fact that the table the user is querying on is actually a fragment or union of some fragments.

It also conceals the fact that the fragments are located at diverse sites.

This is somewhat similar to users of SQL views, where the user may not know that they are using a view of a table instead of the table itself.

Replication Transparency

Replication transparency ensures that replication of databases remain hidden from the users.

It enables users to query upon a table as if only a single copy of the table exists.

Replication transparency is associated with concurrency transparency and failure transparency.

Whenever a user updates a data item, the update is reflected in all the copies of the table. However, this operation should not be known to the user. This is concurrency transparency.
Also, in case of failure of a site, the user can still proceed with his queries using replicated copies without any knowledge of failure. This is failure transparency.

Combination of Transparencies

In any distributed database system, the designer should ensure that all the stated transparencies are maintained to a considerable extent.

The designer may choose to fragment tables, replicate them, and store them at different sites; all hidden to the end-user.

However, complete distribution transparency is a tough task and requires considerable design efforts.

Replication – Data Replication

Data replication is the process of storing separate copies of the database at two or more sites.

It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication

Reliability − In case of failure of any site, the database system continues to work since a copy is available at another site(s).
Reduction in Network Load − Since local copies of data are available, query processing can be done with reduced network usage, particularly during prime hours. Data updating can be done at non-prime hours.
Quicker Response − Availability of local copies of data ensures quick query processing and consequently quick response time.
Simpler Transactions − Transactions require less number of joins of tables located at different sites and minimal coordination across the network. Thus, they become simpler in nature.

Disadvantages of Data Replication

Increased Storage Requirements − Maintaining multiple copies of data is associated with increased storage costs.

The storage space required is in multiples of the storage required for a centralized system.

Increased Cost and Complexity of Data Updating − Each time a data item is updated, the update needs to be reflected in all the copies of the data at the different sites.

This requires complex synchronization techniques and protocols.

Undesirable Application – Database coupling − If complex update mechanisms are not used, removing data inconsistency requires complex co-ordination at the application level.

This results in undesirable applications – database coupling.

Distributed database design – Fragmentation

Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are called fragments.

Fragmentation can be of three types:

Horizontal,
Vertical, and
Hybrid (a combination of horizontal and vertical).

Horizontal fragmentation can further be classified into two techniques:

primary horizontal fragmentation and derived horizontal fragmentation.

Fragmentation should be done in a way so that the original table can be reconstructed from the fragments.

This is needed so that the original table can be reconstructed from the fragments whenever required. This requirement is called the constructiveness.”

Advantages of Fragmentation

Since data is stored close to the site of usage, the efficiency of the database system is increased.
Local query optimization techniques are sufficient for most queries since data is locally available.
Since irrelevant data is not available at the sites, security and privacy of the database system can be maintained.

Disadvantages of Fragmentation

When data from different fragments are required, the access speeds may be very high.
In the case of recursive fragmentations, the job of reconstruction will need expensive techniques.
Lack of back-up copies of data in different sites may render the database ineffective in case of failure of a site.

Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into fragments.

In order to maintain constructiveness, each fragment should contain the primary key field(s) of the table.

Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance with the values of one or more fields.

Horizontal fragmentation should also conform to the rule of constructiveness

Each horizontal fragment must have all columns of the original base table.

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are used.

This is the most flexible fragmentation technique since it generates fragments with minimal extraneous information.

However, reconstruction of the original table is often an expensive task.

Hybrid fragmentation can be done in two alternative ways −

At first, generate a set of horizontal fragments; then generate vertical fragments from one or more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one or more of the vertical fragments.

Allocation criteria in Distributed DBMS

Database control refers to the task of enforcing regulations so as to provide correct data to authentic users and applications of a database.

In order that correct data is available to users, all data should conform to the integrity constraints defined in the database.

Besides, data should be screened away from unauthorized users so as to maintain the security and privacy of the database. Database control is one of the primary tasks of the database administrator (DBA).

The three dimensions of database control are −

Authentication
Access rights
Integrity constraints

Authentication

In a distributed database system, authentication is the process through which only legitimate users can gain access to the data resources.

Authentication can be enforced in two levels −

Controlling Access to Client Computer − At this level, user access is restricted while login to the client computer that provides user-interface to the database server. The most common method is the username/password combination.

However, more sophisticated methods like biometric authentication may be used for high-security data.

Controlling Access to the Database Software − At this level, the database software/administrator assigns some credentials to the user.

The user gains access to the database using these credentials.

One of the methods is to create a login account within the database server.

Access Rights

A user’s access rights refer to the privileges that the user is given regarding DBMS operations such as the rights to create a table, drop a table, add/delete/update tuples in a table or query upon the table.

In distributed environments, since there are a large number of tables and users, hence, it is not feasible to assign individual access rights to users.

So, DDBMS defines certain roles. A role is a construct with certain privileges within a database system.

Once the different roles are defined, the individual users are assigned one of these roles.

Often a hierarchy of roles is defined according to the organization’s hierarchy of authority and responsibility.

Translation of Global Queries / Global Query Optimisation, Query Execution and access plan

When a query is placed, it is at first scanned, parsed, and validated. An internal representation of the query is then created such as a query tree or a query graph.

Then alternative execution strategies are devised for retrieving results from the database tables.

The process of choosing the most appropriate execution strategy for query processing is called query optimization.

Query Optimization Issues in DDBMS

In DDBMS, query optimization is a crucial task.

The complexity is high since the number of alternative strategies may increase exponentially due to the following factors −

The presence of a number of fragments.
Distribution of the fragments or tables across various sites.
The speed of communication links.
The disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query processing rather than the best one.

The time to execute a query is the sum of the following −

Time to communicate queries to databases.
Time to execute local query fragments.
Time to assemble data from different sites.
Time to display results to the application.

Query Processing

Query processing is a set of all activities starting from query placement to displaying the results of the query.