The issue of avoiding duplicate records in relational databases is one of the most common issues or problems. The primary mechanism to prevent that are the primary keys, however, in many databases or tables, primary keys are not defined. We are not going to discuss if not having a primary key is unacceptable or acceptable under the specific circumstance. We know that occurs, especially on data warehouses schemas. Then, in this environment where we don’t have primary keys or unique constraints when we populate tables, we need a way to insert data and be sure that we are not entering duplicate records.
I have seen many times, in different database engines (like PostgreSQL or Oracle) the following kind of insert, with a NOT EXISTS predicate to avoid duplicates. However, as you can see in the following TeamSQL print-screen, it doesn’t work.
As you can see at the bottom of the image, the problem has inserted two times the same record. This issue happened because the subquery to detect duplicates, works against the original table, and not consider the records being inserted by the current insert. This is a definite limitation, that occurs in many database engines. So, if the insert sends two times the same record, the duplication control, can’t detect it.
How to solve it? We need to avoid duplicates in the SELECT query because if our select try to insert two times the same record, our duplication control can’t detect it. By Fortune, avoiding duplicates in the query is not complex, we can do it by using the DISTINCT or the GROUP BY clauses. Let’s see:
Let’s execute the first insert (with the DISTINCT) on PostgreSQL database engine. As you can see we don’t have duplicates because we added the DISTINCT clause to the query.
Let’s execute the second insert (with the GROUP BY) on an Oracle database engine and see that we don’t produce duplicates too.
In this article, we first see a common technique to avoid duplication, which consists on a subquery to detect (and reject it) if the row to be inserted is a previously existing row in the target table. This technique works, however, if the select used to populate the table returns duplicated records, these rows will be inserted, because the subquery doesn’t detect duplication produced in the select. Then we added a “group by” or a “distinct” clause in the select used to populate rows, to avoid duplication of records on the select. With this addition, the technique is completed, and we can be sure we will not insert duplicate rows on the target table.