by Rahul

Categories

  • Big Data
  • java

Tags

  • bigdata
  • java
  • spark

Spark processing multiline csv EOLs in text column

The multi line support for CSV will be added in spark version 2.2 JIRA and for now you can try below steps if you are facing issue while processing CSV:

Get InputFormat and reader classes from git to your code base and implement use it:
Java
` import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext;`

``

//implementation

JavaPairRDD<longwritable, text=""> rdd = context. newAPIHadoopFile(<CSV file path>, FileCleaningInputFormat.class, null, null, new Configuration()); JavaRDD inputWithMultiline= rdd.map(s -> s._2().toString()) </longwritable,>

Another solution for this problem is Apache Crunch CSV reader. This reader can be used like above FileCleaningInputFormat implementation.