熊窝 -=某熊吐槽专用Blog=-: 超长字段数据库删除重复记录

2008年11月4日星期二

超长字段数据库删除重复记录

今天接到一个CASE，在遥远的德国有一个1.1GB的MYSQL数据库，里面只有一个表，表里记录了若干很大的text字段和Blob字段，要求把某个text字段的重复项删除。（表里有上百万的记录，需要排除重复的字段平均大小在1K左右，带索引）
表结构如下：


DROP TABLE IF EXISTS `cert_db`.`certs`;
CREATE TABLE  `cert_db`.`certs` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `ServerName` varchar(128) default NULL,
  `SerialNumber` blob NOT NULL,
  `Version` int(10) unsigned NOT NULL,
  `NotBefore` datetime NOT NULL,
  `NotAfter` datetime NOT NULL,
  `CertData` blob NOT NULL,
  `IPK_DUMP` blob,
  `IPK_N` blob,
  `IPK_E` blob,
  `Signature` blob NOT NULL,
  `SigAlgName` varchar(50) default NULL,
  `SigAlgOID` varchar(50) default NULL,
  `SigAlgParams` blob,
  `SubjectDN` text,
  `IssuerDN` text,
  `CertDump` text,
  `SPK_DUMP` blob,
  `SPK_N` blob,
  `SPK_E` blob,
  `ReallyBAD` char(1) default NULL,
  `X509Dump` blob NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `IDX_CERTDN` USING HASH (`SubjectDN`(383))
) ENGINE=InnoDB AUTO_INCREMENT=4515 DEFAULT CHARSET=gbk;

第一次尝试使用如下语句：


delete from certs
where subjectdn in (
  select subjectdn from certs
    group by
      subjectdn
    having
      count(subjectdn) > 1
)
and id not in (
  select min(id) from  certs
    group by
      certs
    having
      count(certs)>1
)

结果在本地的4000条记录上测试时就华丽的让MYSQL 挂掉了。。
然后使用md5进行索引，分步骤删除，成功：


#hash all
DROP TABLE IF EXISTS hash_all;
create table hash_all(
  select id,md5(subjectdn) as dnhash from certs
);
#find dupclited
DROP TABLE IF EXISTS hash_dup;
create table hash_dup(
  select id,dnhash from hash_all
  group by
    dnhash
  having
    count(dnhash)>1
);
#list rows to delete
DROP TABLE IF EXISTS hash_del;
create table hash_del(
  select h.id,h.dnhash from hash_all h,hash_dup d
  where
    h.id!=d.id
  and
    h.dnhash=d.dnhash
);
#delete
delete c from certs c,hash_del d where c.id=d.id;
#then clean up
DROP TABLE IF EXISTS hash_all;
DROP TABLE IF EXISTS hash_dup;
DROP TABLE IF EXISTS hash_del;

0 人次吐槽:

发表评论

声明

本Blog的内容完全是复制粘贴，本人并不明白其意思，故本人不对以上内容负法律责任，请不要跨省追捕。要详查请自己联系原作者，谢谢！
同时，基于以上原因，本人也不能保证本Blog内所提供信息的真实性与准确性，请读者自行判断。
欲查看完整的免责声明，请移步这里
另：本Blog位于美国，符合美国法律，不适用中华人民共和国刑法中关于非法复制粘贴的条款，河蟹退散。

熊窝 -=某熊吐槽专用Blog=-

2008年11月4日星期二

超长字段数据库删除重复记录

0 人次吐槽:

发表评论

Neko!

声明

Flag Counter

我的简介

饭否

Friendly Links

博客列表

访问统计

Google Friend Connect

博客归档

标签

熊窝 -=某熊吐槽专用Blog=-

2008年11月4日星期二

超长字段数据库删除重复记录

0 人次吐槽:

发表评论

Neko!

声明

Flag Counter

我的简介

饭否

Friendly Links

博客列表

访问统计

Google Friend Connect

来喝茶

博客归档

标签