I just scrape. I believe the script assumes UTF 8
What makes me think it is only the settings or configuration of the system is that the script actually worked and gave me properly encoded Arabic text in the output file. So, I believe I should not touch the script itself as it already worked perfectly before.
I’ve worked with encodings my whole life, and what always helped me figure these things out is remembering that there are no letters in computers, there are just numbers. What you are seeing when you say “strange and funny characters” is just numbers that are not interpreted in the way you expected.
There are 2 possible reasons:
While scraping, the numbers have been interpreted not as the numbers the website uses to represent them. This is the question I asked before.
While displaying them after scraping, the numbers you have in the scraped data are not interpreted in the way you expect. That is a question of how you output the data.
If you can conclusively rule out number one, then please check number two. See how you output the data, which encoding is used to interpret the numbers in the scraped data. Also check the encoding of the shell, if you output them on a shell.
I believe it is supposed to be correct!
In my “settings” → Regional & Language I set the language to be the first option “C”, it was set before to American English, but all gave me these funny characters.
I think there might be some setting, something to install or configure to get it right.
This looks to me more like a developer/development problem than an OS/installation issue. From a development perspective, I can only tell you how I would start debugging the issue:
Get a dump of the website you’re scraping, using either curl or wget. Then using the scraping framework I would try to see into the memory, if the memory representation is the same. If that is not possible, dump the binary representation from the scraper into a file, then binary diff the original wget/curl version and this version.
Thank you.
I will consider your points.
Though I still can’t understand how come the same code worked before and not working now.
I will double check the code.