Web Scraping and Character Encoding

limotux · November 17, 2023, 7:32am

I made a little web scraping script. I do web scraping in 2 languages, English and Arabic.

I had the script on a previous installation and it worked fine for both languages. But I made a fresh install recently. (I am on KDE Plasma)

On the current installation it is scraping OK. When I scrape for a search term in English I get the results OK to the text file I specified.
But if I try an Arabic search term, it goes and appears to be actually scraping but I get the text file containing strange and funny characters like “ÙÙØ¸ÙØ© Ø´ÙÙÙØ© ÙÙØ¨ÙÙ Ø§ÙØ¯ÙÙÙ Ù” which is supposed to be in Arabic alphabets.

FYI, I have Arabic fonts already installed, I can type and read Arabic text in any app normally without issues.

So, as the script worked before without problem I guess the issue is with how the system handles the scraped text.

What can I do to get it to write the Arabic characters properly in a readable format.
I will highly appreciate any help.

DromundKaas · November 17, 2023, 8:16am

Do you honor the encoding that the website reports the content is in while you are scraping?

limotux · November 17, 2023, 8:49am

I just scrape. I believe the script assumes UTF 8
What makes me think it is only the settings or configuration of the system is that the script actually worked and gave me properly encoded Arabic text in the output file. So, I believe I should not touch the script itself as it already worked perfectly before.

DromundKaas · November 17, 2023, 9:03am

I’ve worked with encodings my whole life, and what always helped me figure these things out is remembering that there are no letters in computers, there are just numbers. What you are seeing when you say “strange and funny characters” is just numbers that are not interpreted in the way you expected.

There are 2 possible reasons:

While scraping, the numbers have been interpreted not as the numbers the website uses to represent them. This is the question I asked before.
While displaying them after scraping, the numbers you have in the scraped data are not interpreted in the way you expect. That is a question of how you output the data.

If you can conclusively rule out number one, then please check number two. See how you output the data, which encoding is used to interpret the numbers in the scraped data. Also check the encoding of the shell, if you output them on a shell.

limotux · November 17, 2023, 9:11am

I will say it again, I am not that expert technically. But I see this sounds somehow to the point.
This is my lcoale:

limo@debian:~$ locale
LANG=en_US.UTF-8
LANGUAGE=C
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
limo@debian:~$

I believe it is supposed to be correct!
In my “settings” → Regional & Language I set the language to be the first option “C”, it was set before to American English, but all gave me these funny characters.

I think there might be some setting, something to install or configure to get it right.

DromundKaas · November 17, 2023, 4:03pm

This looks to me more like a developer/development problem than an OS/installation issue. From a development perspective, I can only tell you how I would start debugging the issue:

Get a dump of the website you’re scraping, using either curl or wget. Then using the scraping framework I would try to see into the memory, if the memory representation is the same. If that is not possible, dump the binary representation from the scraper into a file, then binary diff the original wget/curl version and this version.

Good luck.

limotux · November 17, 2023, 4:14pm

Thank you.
I will consider your points.
Though I still can’t understand how come the same code worked before and not working now.
I will double check the code.

Thank you.

system · November 19, 2023, 4:14pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.