[lug] Python and Unicode - Good Grief! (A Rant)

Steve Sullivan steve.sullivan at mathcom.com
Tue Mar 7 14:52:39 MST 2017


Hi Jed,

It might be worth trying python 3, if you're haven't already.
In python 2 a lot of the string - unicode stuff was a pain.
In python 3 the 'str' type is now unicode internally, and often
the unicode stuff "just works".

Steve

On Tue, Mar 07, 2017 at 12:08:16PM -0700, Jed S. Baer wrote:
> Hi Folks.
> 
> Yes, this is a rant. Maybe Python doesn't follow Larry Wall's "simple
> things should be easy" dictum. Well, it sure isn't helping me 1) get
> something done, and 2) learn the language. Behold!
> 
> I'm trying to write a web page scraper to produce RSS that I can feed
> into Liferea (a news aggregator). Yeah, it'd be nice if the site provided
> RSS, but it doesn't, and to call the HTML malformed would be a
> compliment. But hey, this should be do-able. So, I grab BeautifulSoup to
> see if I can come up with a nice link extractor. The HTML is really bad,
> so I think I'll use the "prettify" method to see what it looks like
> nested, so I can, I hope, visually identify the child/parent structure.
> 
> So, I run my script with output to the terminal. Fine, except that the
> indentation ends up wrapping. No biggie, I'll just redirect to a file.
> 
> $ ./scrape2rss.py > foo.html
> Traceback (most recent call last):
>   File "./scrape2rss.py", line 34, in <module>
>     print(soup.prettify())
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in
> position 27: ordinal not in range(128)
> 
> Oh, good grief. I guess I must be thinking backwards. If there's going to
> be a problem with character set conversion, wouldn't it be when sending
> output to a terminal, with all the TERMCAP and LC_LOCALE stuff being
> used? A file? Who cares? Just send the bits through the redirect.
> 
> Yes, I know that someplace, in all the classes and methods underlying all
> the cool stuff that BeautifulSoup is trying to "help" me with, there's
> some stuff I'm sure I can do to tell it to use a codepage, or UTF8, or
> something. But really. Aaaargh.
> 
> (Yes, I know this is really not Python, as such, but something else in
> the bowels of some module being pulled in by BeatifulSoup.)
> 
> cf. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety

-- 

========================================
Steve Sullivan      steve.sullivan at mathcom.com
720-587-7498        http://www.mathcom.com
========================================


More information about the LUG mailing list