2008年11月4日星期二

超长字段数据库删除重复记录

今天接到一个CASE,在遥远的德国有一个1.1GB的MYSQL数据库,里面只有一个表,表里记录了若干很大的text字段和Blob字段,要求把某个text字段的重复项删除。(表里有上百万的记录,需要排除重复的字段平均大小在1K左右,带索引)
表结构如下:

DROP TABLE IF EXISTS `cert_db`.`certs`;
CREATE TABLE `cert_db`.`certs` (
`id` int(10) unsigned NOT NULL auto_increment,
`ServerName` varchar(128) default NULL,
`SerialNumber` blob NOT NULL,
`Version` int(10) unsigned NOT NULL,
`NotBefore` datetime NOT NULL,
`NotAfter` datetime NOT NULL,
`CertData` blob NOT NULL,
`IPK_DUMP` blob,
`IPK_N` blob,
`IPK_E` blob,
`Signature` blob NOT NULL,
`SigAlgName` varchar(50) default NULL,
`SigAlgOID` varchar(50) default NULL,
`SigAlgParams` blob,
`SubjectDN` text,
`IssuerDN` text,
`CertDump` text,
`SPK_DUMP` blob,
`SPK_N` blob,
`SPK_E` blob,
`ReallyBAD` char(1) default NULL,
`X509Dump` blob NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_CERTDN` USING HASH (`SubjectDN`(383))
) ENGINE=InnoDB AUTO_INCREMENT=4515 DEFAULT CHARSET=gbk;

第一次尝试使用如下语句:

delete from certs
where subjectdn in (
select subjectdn from certs
group by
subjectdn
having
count(subjectdn) > 1
)
and id not in (
select min(id) from certs
group by
certs
having
count(certs)>1
)

结果在本地的4000条记录上测试时就华丽的让MYSQL 挂掉了。。
然后使用md5进行索引,分步骤删除,成功:

#hash all
DROP TABLE IF EXISTS hash_all;
create table hash_all(
select id,md5(subjectdn) as dnhash from certs
);
#find dupclited
DROP TABLE IF EXISTS hash_dup;
create table hash_dup(
select id,dnhash from hash_all
group by
dnhash
having
count(dnhash)>1
);
#list rows to delete
DROP TABLE IF EXISTS hash_del;
create table hash_del(
select h.id,h.dnhash from hash_all h,hash_dup d
where
h.id!=d.id
and
h.dnhash=d.dnhash
);
#delete
delete c from certs c,hash_del d where c.id=d.id;
#then clean up
DROP TABLE IF EXISTS hash_all;
DROP TABLE IF EXISTS hash_dup;
DROP TABLE IF EXISTS hash_del;

0 人次吐槽:

发表评论