Kate text editor (EOS KDE), can I find duplicate lines, or do I need to download another program?

nicknick · July 28, 2023, 7:25pm

Hello friends.

I would like to know if it is possible, in EOS KDE, with Kate text editor, to find lines that are duplicate in the same text file.

For example, these 2 links, in a text file opened with Kate, are 2 full lines exactly the same:

https://forum.endeavouros.com/c/arch-based-related-questions/software/15
https://forum.endeavouros.com/c/arch-based-related-questions/software/15

So I need to know when a complete line is repeated.

I am annotating addresses in a text file and need them not to be duplicate by mistake as I am copying them by hand with ctrl+c and ctrl+v.

If it’s not possible, is there a program that can do this?

Thanks in advance friends!

Kresimir · July 28, 2023, 7:35pm

You can! Kate has a very powerful advanced feature to filter the selection through a shell pipe. So you can use any combination of command line utilities from within Kate, without actually using a terminal.

TL;DR: use method 3.

Method 1.

If the order of lines is not important, you can just select the whole document and select “Filter Through Command…” in the “Tools” menu (default keybinding is Ctrl+\). There, in the input box, you type in this command:

sort | uniq

This will pipe your selection through the sort command, and the output of it through uniq command, so it will first sort everything, and then remove all duplicates, and you’re done. However, the file is sorted.

Example

Suppose you start with this file

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Select everything, filter through command sort | uniq, you end up with

Bill
Bob
Dinkelstein
Jebediah
Linus
Mortimer
Steve
Valentina

They are all unique, but sorted alphabetically (notice, e.g. Dinkelstein now comes before Jebediah). Often, this is good enough, but sometimes you want to preserve the order, so this method will not work…

Method 2.

If you just want to find and count the duplicate lines, do not select anything, and just use filter through command with this:

sort | uniq -d -c

This will append the count of duplicates lines to the end of the file, so you can know which lines are duplicate, and how many times.

Example

Suppose you start with the same input:

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Do not select anything, and just filter through command sort | uniq -d -c, you’ll end up with this appended to the end of the file:

      2 Bill
      2 Bob
      2 Valentina

This tells you what the duplicates are.

Method 3

There is a much better way, which you should probably use. I’m mentioning it last, because to understand it, it requires you to learn the AWK programming language which, although not very difficult, is certainly not trivial. Of course, you can use it without understanding it, but that’s fairly limiting.

It’s very straightforward, just select everything, “Tools” → “Filter Through Command…” and use this command:

awk '!seen[$0]++'

Example

Again same input:

Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina

Select everything, filter through command awk '!seen[$0]++', you end up with

Bill
Bob
Jebediah
Valentina
Mortimer
Linus
Dinkelstein
Steve

They are all unique and the order is good.

I’m not going to explain how it works, it’s beyond the scope of this answer. A good book on the subject is called The AWK programming language, written by the creators of the language. If you were so inclined, I’m sure you could find a PDF of it online somewhere, though I am legally obliged to say that I strongly disavow doing anything illegal and would never suggest it…

If you take the time to learn the command line utilities, you will find tons of great uses for Kate’s filter through command feature.

If you do not know what pipes are, you can watch this video: UNIX: Making Computers Easier To Use -- AT&T Archives film from 1982, Bell Laboratories - YouTube. Here the legendary Brian Kernighan explains pipes (and later, Lorinda Cherry demonstrates the same thing). Also, notice how almost everyone had an awesome Unix beard. Your skill at the command line is roughly proportional to the length of your beard, that’s just a fact.

Here is another video, of the same man, explaining the same thing, some 40 years later: Unix Pipeline (Brian Kernighan) - Computerphile - YouTube. (No demonstration, unfortunately).

Some less important info:

Fun fact: the name of the AWK programming language stands for “Aho - Weinberger - Kernighan”, the original authors of the concept. If you’ve watched the first video above, you’ve already met Alfred Aho and Brian Kernighan.

Why does the input need to be sorted first before piping it into uniq utility? It’s a limitation of the uniq utility, it only detects duplicates in adjacent lines. This allows it to be very fast and memory efficient. For more info, read man uniq.

Also, instead of sort | uniq you can write sort -u, it does the same thing (here it doesn’t matter, because you’re just doing it one time, but if you’re writing a shell script, you should be spawning the fewest number of additional processes – ideally zero).

d-air1 · July 29, 2023, 4:56am

If you search for the line in question wouldn’t it show you how many times that line occurs. Then you would know it is a duplicate or is the goal just to automate the process?

Kresimir · July 29, 2023, 5:27am

That’s true, and it works for a couple of lines, but it’s probably the worst way to do it if, for example, you have a thousand lines or so…

d-air1 · July 29, 2023, 5:33am

I understand that. It certainly falls apart if you have a lot of duplicates, but I’m unsure if that is the case with OP. If he is only working with a few dozen lines or if there aren’t a bunch of different duplicates. This way would certainly be less work.

vazicebon · July 29, 2023, 7:19am

Hello,

Geany has a plugin for this :
https://plugins.geany.org/lineoperations.html

kainonergon · July 29, 2023, 1:20pm

Hi. I don’t know if Kate can find duplicate lines, but it can remove them. First select all lines you want to check, press F7 (switch to the command line), type ‘uniq’ and press ‘Enter’.

Source: https://narkive.com/JtFZWhQe:2.780.60

PS. See also:
https://docs.kde.org/stable5/en/kate/katepart/advanced-editing-tools-commandline.html

Kresimir · July 29, 2023, 2:34pm

Have you tried it? As someone who uses Kate at least 3 hours a day, I can tell you, this will not work.

EDIT: I might be wrong about this.

kainonergon · July 29, 2023, 2:36pm

Yes, I’ve tried it just before posting here. Works just fine.

nicknick · July 29, 2023, 8:19pm

Kresimir:

You can! Kate has a very powerful advanced feature to filter the selection through a shell pipe. So you can use any combination of command line utilities from within Kate, without actually using a terminal.

TL;DR: use method 3.

Method 1.

If the order of lines is not important, you can just select the whole document and select “Filter Through Command…” in the “Tools” menu (default keybinding is Ctrl+</kbd>). There, in the input box, you type in this command:
sort | uniq
This will pipe your selection through the sort command, and the output of it through uniq command, so it will first sort everything, and then remove all duplicates, and you’re done. However, the file is sorted.

Example

Suppose you start with this file
Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina
Select everything, filter through command sort | uniq, you end up with
Bill
Bob
Dinkelstein
Jebediah
Linus
Mortimer
Steve
Valentina
They are all unique, but sorted alphabetically (notice, e.g. Dinkelstein now comes before Jebediah). Often, this is good enough, but sometimes you want to preserve the order, so this method will not work…

Method 2.

If you just want to find and count the duplicate lines, do not select anything, and just use filter through command with this:
sort | uniq -d -c
This will append the count of duplicates lines to the end of the file, so you can know which lines are duplicate, and how many times.

Example

Suppose you start with the same input:
Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina
Do not select anything, and just filter through command sort | uniq -d -c, you’ll end up with this appended to the end of the file:
      2 Bill
      2 Bob
      2 Valentina
This tells you what the duplicates are.

Method 3

There is a much better way, which you should probably use. I’m mentioning it last, because to understand it, it requires you to learn the AWK programming language which, although not very difficult, is certainly not trivial. Of course, you can use it without understanding it, but that’s fairly limiting.

It’s very straightforward, just select everything, “Tools” → “Filter Through Command…” and use this command:
awk '!seen[$0]++'
Example

Again same input:
Bill
Bob
Jebediah
Valentina
Mortimer
Bill
Linus
Dinkelstein
Bob
Steve
Valentina
Select everything, filter through command awk '!seen[$0]++', you end up with
Bill
Bob
Jebediah
Valentina
Mortimer
Linus
Dinkelstein
Steve
They are all unique and the order is good.

I’m not going to explain how it works, it’s beyond the scope of this answer. A good book on the subject is called The AWK programming language, written by the creators of the language. If you were so inclined, I’m sure you could find a PDF of it online somewhere, though I am legally obliged to say that I strongly disavow doing anything illegal and would never suggest it…

If you take the time to learn the command line utilities, you will find tons of great uses for Kate’s filter through command feature.

If you do not know what pipes are, you can watch this video: UNIX: Making Computers Easier To Use – AT&T Archives film from 1982, Bell Laboratories - YouTube. Here the legendary Brian Kernighan explains pipes (and later, Lorinda Cherry demonstrates the same thing). Also, notice how almost everyone had an awesome Unix beard. Your skill at the command line is roughly proportional to the length of your beard, that’s just a fact.

Here is another video, of the same man, explaining the same thing, some 40 years later: Unix Pipeline (Brian Kernighan) - Computerphile - YouTube. (No demonstration, unfortunately).

Some less important info:

Fun fact: the name of the AWK programming language stands for “Aho - Weinberger - Kernighan”, the original authors of the concept. If you’ve watched the first video above, you’ve already met Alfred Aho and Brian Kernighan.

Why does the input need to be sorted first before piping it into uniq utility? It’s a limitation of the uniq utility, it only detects duplicates in adjacent lines. This allows it to be very fast and memory efficient. For more info, read man uniq.

Also, instead of sort | uniq you can write sort -u, it does the same thing (here it doesn’t matter, because you’re just doing it one time, but if you’re writing a shell script, you should be spawning the fewest number of additional processes – ideally zero).

Omg, thanks so much for looking up all those different methods to do it!!!

Really method 1 and 3 work perfect, especially since I thought I would have to remove the duplicate manually, so by removing it automatically, it will help me a lot and make the job much easier for me! (because I need to copy about 800-1000 links)

You really seem to know a lot about programming, that’s great! I just recently quit Windows and it scares me just looking at Terminal. Thank you very much for all these resources, I’ll take a look at it, although I don’t think I’ll learn programming, since that’s quite difficult for me and unfortunately I don’t have much time/patience, but it’s interesting to know!

By the way, from the Linux world, I only know Linus Torvald.

Method 1 sorts all these links, so this is also very interesting, I’ll use both anyway, again, thank you very much for all your effort, this will really help me a lot, thank you!

By the way, were you the one who told me a while ago that there was a download manager similar to JDownlaoder that could download links in a personalized way, with programming or something like that?

Yes friend, unfortunately there are approximately 800-1000 links, and I could not know if I have copied the same link several times by mistake.

Thank you very much friend, but luckily I was able to fix it with our friend’s command.

Anyway, thanks, maybe that can help me another time!

Thanks friend, it worked for me but only the first time, the second time it didn’t work, I don’t know if it’s a bug or something.

Anyway, thanks for the help!

Kresimir · July 30, 2023, 6:22am

Thanks for asking a question I enjoyed answering. Not many of those around. But questions about Kate are almost always great.

By the way, I didn’t have to look up any of that, I’ve been using it for years.

Yeah, I think so. Told you to look into wget, which is a great little command line utility that downloads stuff from the web. I use it all the time.

nicknick · July 30, 2023, 12:22pm

That’s really cool! I think I’m going to have to learn some programming in the future too!

I see, thanks!

So, I’m going to open a thread about what I’m trying to download with JDownloader 2, maybe wget is better for it, you can take a look at it and tell me if I need to learn some programming language or something, for what I want to achieve!

I’ve been watching video tutorials on wget though, and it’s 100% Terminal-based, which makes it a bit tricky to sort/remove files, before starting the download.

system · August 1, 2023, 12:22pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.