Web Scraping and Character Encoding

I’ve worked with encodings my whole life, and what always helped me figure these things out is remembering that there are no letters in computers, there are just numbers. What you are seeing when you say “strange and funny characters” is just numbers that are not interpreted in the way you expected.

There are 2 possible reasons:

  • While scraping, the numbers have been interpreted not as the numbers the website uses to represent them. This is the question I asked before.
  • While displaying them after scraping, the numbers you have in the scraped data are not interpreted in the way you expect. That is a question of how you output the data.

If you can conclusively rule out number one, then please check number two. See how you output the data, which encoding is used to interpret the numbers in the scraped data. Also check the encoding of the shell, if you output them on a shell.