Kate text editor (EOS KDE), can I find duplicate lines, or do I need to download another program?

Kresimir · July 28, 2023, 7:35pm

You can! Kate has a very powerful advanced feature to filter the selection through a shell pipe. So you can use any combination of command line utilities from within Kate, without actually using a terminal.

TL;DR: use method 3.

Method 1.

If the order of lines is not important, you can just select the whole document and select “Filter Through Command…” in the “Tools” menu (default keybinding is Ctrl+\). There, in the input box, you type in this command:

sort | uniq

This will pipe your selection through the sort command, and the output of it through uniq command, so it will first sort everything, and then remove all duplicates, and you’re done. However, the file is sorted.

Example

Suppose you start with this file

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Select everything, filter through command sort | uniq, you end up with

Bill
Bob
Dinkelstein
Jebediah
Linus
Mortimer
Steve
Valentina

They are all unique, but sorted alphabetically (notice, e.g. Dinkelstein now comes before Jebediah). Often, this is good enough, but sometimes you want to preserve the order, so this method will not work…

Method 2.

If you just want to find and count the duplicate lines, do not select anything, and just use filter through command with this:

sort | uniq -d -c

This will append the count of duplicates lines to the end of the file, so you can know which lines are duplicate, and how many times.

Example

Suppose you start with the same input:

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Do not select anything, and just filter through command sort | uniq -d -c, you’ll end up with this appended to the end of the file:

      2 Bill
      2 Bob
      2 Valentina

This tells you what the duplicates are.

Method 3

There is a much better way, which you should probably use. I’m mentioning it last, because to understand it, it requires you to learn the AWK programming language which, although not very difficult, is certainly not trivial. Of course, you can use it without understanding it, but that’s fairly limiting.

It’s very straightforward, just select everything, “Tools” → “Filter Through Command…” and use this command:

awk '!seen[$0]++'

Example

Again same input:

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Select everything, filter through command awk '!seen[$0]++', you end up with

Bill
Bob
Jebediah
Valentina
Mortimer
Linus
Dinkelstein
Steve

They are all unique and the order is good.

I’m not going to explain how it works, it’s beyond the scope of this answer. A good book on the subject is called The AWK programming language, written by the creators of the language. If you were so inclined, I’m sure you could find a PDF of it online somewhere, though I am legally obliged to say that I strongly disavow doing anything illegal and would never suggest it…

If you take the time to learn the command line utilities, you will find tons of great uses for Kate’s filter through command feature.

If you do not know what pipes are, you can watch this video: https://www.youtube.com/watch?v=XvDZLjaCJuw. Here the legendary Brian Kernighan explains pipes (and later, Lorinda Cherry demonstrates the same thing). Also, notice how almost everyone had an awesome Unix beard. Your skill at the command line is roughly proportional to the length of your beard, that’s just a fact.

Here is another video, of the same man, explaining the same thing, some 40 years later: https://www.youtube.com/watch?v=bKzonnwoR2I. (No demonstration, unfortunately).

Some less important info:

Fun fact: the name of the AWK programming language stands for “Aho - Weinberger - Kernighan”, the original authors of the concept. If you’ve watched the first video above, you’ve already met Alfred Aho and Brian Kernighan.

Why does the input need to be sorted first before piping it into uniq utility? It’s a limitation of the uniq utility, it only detects duplicates in adjacent lines. This allows it to be very fast and memory efficient. For more info, read man uniq.

Also, instead of sort | uniq you can write sort -u, it does the same thing (here it doesn’t matter, because you’re just doing it one time, but if you’re writing a shell script, you should be spawning the fewest number of additional processes – ideally zero).