Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

Closed
andrealexandre opened this issue Jul 8, 2014 · 11 comments
Milestone

Comments

@andrealexandre
Copy link

Take the csv header example:

"ID", "Name", "Age
(...)

When the file is encoded UTF-8 with BOM, the CsvReader will return the first column came with the following string: ""ID"".

@cowtowncoder
Copy link
Member

Just to make sure: you mean that CSV content that has UTF-8 BOM (3 bytes) will cause first header name to be reported incorrectly? Could you share bit of code to show how you are accessing field name?

@andrealexandre
Copy link
Author

I'm actually using direct mapping to a POJO object.

CsvMapper mapper = new CsvMapper();

mapper.setSerializationInclusion(Include.NON_NULL)
            .enable(MapperFeature.AUTO_DETECT_GETTERS)
            .enable(MapperFeature.AUTO_DETECT_IS_GETTERS)
            .enable(MapperFeature.AUTO_DETECT_SETTERS)
            .disable(MapperFeature.AUTO_DETECT_FIELDS)
            .disable(SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS)
            .disable(SerializationFeature.FAIL_ON_EMPTY_BEANS)
            .setPropertyNamingStrategy(PropertyNamingStrategy.PASCAL_CASE_TO_CAMEL_CASE)
            .disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);

CsvSchema schema = CsvSchema.emptySchema().withHeader();

File file = new File("{CSV file UTF-8 with BOM}");

if(!file.exists())
    System.out.println("File doesn't exist.");

MappingIterator<MyObject> objects = mapper.reader(MyObject.class)
                                      .with(schema)
                                      .readValues(file);

I know this, because I debugged my application and have gone through most of your source code and ended up finding the BOM char on the CsvReader.java file. Around the line 559.
You also can find the char inside the "_inputBuffer", right at the position 0.

(I was also wondering if there is a option to specify the file encoding, I think could avoid this problem)

@cowtowncoder
Copy link
Member

I don't offer a way to specify encoding because it's safer to just require use of java.io.Reader, if encoding is already known. With CSV things are more difficult wrt auto-detection (since there's no well-known start sequence), but it should be relatively easy to fix BOM handling. It's just not properly tested I think.

So: I just want to know exact BOM bytes in use -- there are kinds of broken content where what looks like a BOM is not valid one.

Is it possible to share the File, or at least first couple of bytes? It should be easy enough to figure out the problem with that.

Thanks again for reporting the problem.

@andrealexandre
Copy link
Author

andrealexandre commented Jul 8, 2014

I can only share an example file, but I tested it and I found the same
problem with this file.

I found that you can replicate this by editing a .csv file with Notepad++
and convert to encoding to UTF-8.

I hope it was helpful.

2014-07-08 21:02 GMT+01:00 Tatu Saloranta [email protected]:

I don't offer a way to specify encoding because it's safer to just require
use of java.io.Reader, if encoding is already known. With CSV things are
more difficult wrt auto-detection (since there's no well-known start
sequence), but it should be relatively easy to fix BOM handling. It's just
not properly tested I think.

So: I just want to know exact BOM bytes in use -- there are kinds of
broken content where what looks like a BOM is not valid one.

Is it possible to share the File, or at least first couple of bytes? It
should be easy enough to figure out the problem with that.

Thanks again for reporting the problem.


Reply to this email directly or view it on GitHub
#47 (comment)
.

Best regards,
André Alexandre

@cowtowncoder
Copy link
Member

If you could just list first couple of bytes of the file -- BOM, and couple of bytes of JSON itself. I just want to make 100% sure I use exact same setup, and it is quite easy to get different files as different tools have different capabilities wrt detection and handling of BOMs.

@andrealexandre
Copy link
Author

Very well, I understand.
I sent the first couple of bytes from the file I used.

2014-07-09 20:52 GMT+01:00 Tatu Saloranta [email protected]:

If you could just list first couple of bytes of the file -- BOM, and
couple of bytes of JSON itself. I just want to make 100% sure I use exact
same setup, and it is quite easy to get different files as different tools
have different capabilities wrt detection and handling of BOMs.


Reply to this email directly or view it on GitHub
#47 (comment)
.

Com os melhores cumprimentos,
André Alexandre

cowtowncoder added a commit that referenced this issue Jul 17, 2014
@cowtowncoder
Copy link
Member

Looks like I can reproduce this easily, and that one char is prepended. Can be anywhere from 1 to 3 bytes, as resulting char, 0xFEFF is "illegal character" marker.

@cowtowncoder
Copy link
Member

Interesting. So, CsvParserBootstrapper seems like it should work. But I hadn't connected that to CsvFactory... which is why BOM is simply ignored, it seems.

@andrealexandre
Copy link
Author

I know the issue is already closed, but I found quite a neat solution for this problem, through the Apache Commons IO library is there a decorator class named BOMInputStream for the InputStream to skip the BOM byte. I tested it, and works fine with the CsvMapper.

@cowtowncoder
Copy link
Member

Thanks, that should be useful for general problem, and good to know of.

@tom999
Copy link

tom999 commented Jan 16, 2018

Thanx helped me out, after first columns was always null.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants