Ignore BOM on CSV when sending bulk invitations

barryvan · May 9, 2017, 4:30am

When creating a CSV in Excel, it by default includes (what I think is) a BOM marker at the top of the file, to ensure that it’s parsed as UTF-8. Unfortunately Discourse doesn’t handle this (and doesn’t read the file as UTF-8!) so the first email address always fails with an error like this:

(Note that the “j” is the first letter of the actual email address.)

I think Discourse should strip/ignore the BOM on CSVs – it’s probably pretty common that people would put these together in Excel, after all.

codinghorror · May 9, 2017, 6:03am

That might make sense @techapj I can confirm via hex editor, when Excel 2016 file is saved as CSV

a,b,c,d
e,f,g,h

there is the BOM marker by default:

UTF-8 EF BB BF

Mittineague · May 9, 2017, 6:28am

I went to review the W3C page to double check my understanding that the BOMs were only needed for UTF-16 and UTF-32 because UTF-8 did not have big / little endian.
With HTML5 things have changed a bit since I last visited the page.
Older browsers could use …BE and …LE HTTP headers but now the only way is by reading the BOM
tl;dr for UTF-8 yes strip the BOM

https://www.w3.org/International/questions/qa-byte-order-mark

techAPJ · May 9, 2017, 5:19pm

Fixed via:

https://github.com/discourse/discourse/commit/e6e00253267008b17c760befbd852b5fdf998060

Topic		Replies	Views
Bulk Invite error due to unknown UTF-8 character Bug	8	1179	October 17, 2016
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	6	1448	October 31, 2018
Bulk Invite Message Garbled UX	4	881	January 15, 2020
Error importing from vanilla: invalid byte sequence in UTF-8 Migration	25	2116	October 18, 2023
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	741	May 9, 2022

Ignore BOM on CSV when sending bulk invitations

Related topics