MySQL Replication is widely and we need to have strong replication monitoring for these things
1) To monitor if the replication is working.
2) To check the latency between the master and slave .
3) TO check the consistency between the master and slave , as sometime due to manual or master server crash , master and slave may go out of sync.
lets see how we can set up monitoring for each of the scenarios.
MySQL replication status on the can be checked via
mysql> show slave status \G;
output would be like
Slave_IO_State: Waiting for master to send event
Master_Host: master
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000542
Read_Master_Log_Pos: 231260599
Relay_Log_File: relaylog.000496
Relay_Log_Pos: 231260744
Relay_Master_Log_File: mysql-bin.000542
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 231260599
Relay_Log_Space: 231260935
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2013
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
more details on this can be found onhttp://dev.mysql.com/doc/refman/5.1/en/show-slave-status.html
Case 1) MySQL Replication is working
As we know that MySQL Replication uses two threads Slave_IO and Slave_SQL , slave_IO is responsible for reading the master binary log and writing events from master to the relay log on the slave and slave_SQL is responsible for executing the events from the relay log on the slave .so for MySQL Replication to work both the threads must be running , to check that from the show slave status \G the following values must be Slave_IO_Running: Yes and Slave_SQL_Running: Yes
we can check the variables using the shell or script or Nagios to monitor.
Case 2) To check the latency between the master and slave
MySQL Replication is real time most of the time , but because of many reasons( Slave heavily loaded , IO issue on slave) slave may not be able to catch up with the master.
I have seen many people using the “Seconds_Behind_Master” from the output of the show slave status , it works most of the time but not always .That value is comparing the slave’s current time with the time on the master when it executed the currently replicating SQL statement. If the slave is low-volume (or not properly slaving) you can end up with misleading information or even a false sense of security.Consider another case when you have more level of replication example A is master server B is slave of A and Server C which is slave of B , if there is some issue in replication between A and B , the “Seconds_Behind_Master” on C will still be showing as 0 but in actual the replication is broken in the sense that C is not getting the latest data from A.
The correct way of monitoring the replication is using
http://www.maatkit.org/doc/mk-heartbeat.html
how to use the mk-heartbeat
on master download the script from
wget http://www.maatkit.org/get/mk-heartbeat
make it executable
on master(192.168.2.80)
create table heartbeat on Database heart
CREATE TABLE heartbeat (
id int NOT NULL PRIMARY KEY,
ts datetime NOT NULL
);
It needs to have at least one row
INSERT INTO heartbeat (id) VALUES (1);
now run the script and make it a daemon
./mk-heartbeat -D heart –table heartbeat -u heartbeat -p XXXXXXXXX –update -h 192.168.2.80
on slave(192.168.2.82) download the script from
wget http://www.maatkit.org/get/mk-heartbeat
make it executable
./mk-heartbeat -D heart –table heartbeat -u heartbeat_slave -p XXXXXXXXX –monitor -h 192.168.2.82
the output would be something like
1s [ 0.02s, 0.00s, 0.00s ]
1s [ 0.03s, 0.01s, 0.00s ]
1s [ 0.05s, 0.01s, 0.00s ]
0s [ 0.05s, 0.01s, 0.00s ]
0s [ 0.05s, 0.01s, 0.00s ]
0s [ 0.05s, 0.01s, 0.00s ]
0s [ 0.05s, 0.01s, 0.00s ]
1s [ 0.07s, 0.01s, 0.00s ]
the output will tell if mysql replication is having any lag or not.
case 3) Inconsistency between Mater and Slave
Some time because of manual error or master crash or unclean shutdown the master and slave may be out of sync, its very important to check that and take corrective action .
if not detected on time slave might go completely out of sync and might need to be set completely.
how do we check that master and slave are completely sync.
we can gr8 tool mk-table-checksum from maatkit http://www.maatkit.org/doc/mk-table-checksum.html
lets see how we can use the mk-table-checksum
download
wget http://www.maatkit.org/get/mk-table-checksum
make it executable
create table checksum on the master 192.168.2.10
on database Test
CREATE TABLE checksum (
db char(64) NOT NULL,
tbl char(64) NOT NULL,
chunk int NOT NULL,
boundaries char(100) NOT NULL,
this_crc char(40) NOT NULL,
this_cnt int NOT NULL,
master_crc char(40) NULL,
master_cnt int NULL,
ts timestamp NOT NULL,
PRIMARY KEY (db, tbl, chunk)
);
for checking we will create a test table
create table testreplication( i int , b varchar(100))
we will insert a row
insert into testreplication
select 1,’name’
–replicate=test.checksum will replicate the checksum run on master(192.168.2.10) to slave (192.168.2.12)
–tables test.testreplication will check only the table test.testreplication
./mk-table-checksum -u test -p XXXXXXX –replicate=test.checksum –tables test.testreplication 192.168.2.10
When we run it we get the output on command line
DATABASE TABLE CHUNK HOST ENGINE COUNT CHECKSUM TIME WAIT STAT LAG
test testreplication 0 192.168.2.10 MyISAM 1 1e5504e6 0 NULL NULL NULL
we now log in to the slave (192.168.2.12) and on the database test we run this query
SELECT db, tbl, chunk, this_cnt-master_cnt AS cnt_diff,
this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc)
AS crc_diff
FROM checksum
WHERE master_cnt <> this_cnt OR master_crc <> this_crc
OR ISNULL(master_crc) <> ISNULL(this_crc);
out put will be blank as the replication is working properly
db tbl chunk cnt_diff crc_diff
now we remove the row directly from the slave (192.168.2.12)
delete from testreplication ;
we again run the
./mk-table-checksum -u test -p XXXXXXX –replicate=test.checksum –tables test.testreplication app146
output on command line
DATABASE TABLE CHUNK HOST ENGINE COUNT CHECKSUM TIME WAIT STAT LAG
test testreplication 0 192.168.2.10 MyISAM 1 1e5504e6 0 NULL NULL NULL
when we run the query to check the result
SELECT db, tbl, chunk, this_cnt-master_cnt AS cnt_diff,
this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc)
AS crc_diff
FROM checksum
WHERE master_cnt <> this_cnt OR master_crc <> this_crc
OR ISNULL(master_crc) <> ISNULL(this_crc);
now we get
db tbl chunk cnt_diff crc_diff
test testreplication 0 -1 1
as now the master and slave data differs , the output is giving table name and the crc_diff , one very important thing to understand that this finds the tables which are not in sync even when the new data is getting logged into the master DB as the checksum calculating command on the master is replicated exactly on the slave and we know that the mysql replication is syncrnous so on the slave when this command to calculate the checksum is executed its considering the data as it was on master.using this we can find out all the tables which are not in sync in master and slave and take corrective action.
in next post I would be writing in detail how to find the difference on the individual tables between master and slave and how to rectify that and how to automate that process .
Thanks
Pankaj Joshi