Hadoop Connector - Binary Attachments

I’ve used the Hadoop Connector to import data with Sqoop into HDFS. When I import a database of text documents, I can see the data came in as CSV. I also have a database that has text documents with image (jpg) attachments. When I sqoop that database into HDFS, the data comes through as a huge blob. The documentation does not explain what the hadoop connector is doing. Can someone explain what its doing to the non-text data? Is it just dumping a byte array as a field in the CSV?

I’m trying to figure out how to write a map-reduce job in Hadoop to analyze the data.

Thank you!

I believe it’s mentioned in the Hadoop connector that we normalize to strings. That’s done with the toString() on the object. If you want to control that to get it into some kind of normalized format, you can probably wrap in a custom transcoder.

Are you looking to process the images in Hadoop?