All you need to know about Avro schema

Reading Time: 4 minutes

In this post, we are going to dive into the basics of the Avro Schema. We will create a sample avro schema and serialize it to a sample output file and also read the file as an example according to the avro schema.

Intro to Avro

Apache Avro is a data serialization system developed by Doug Cutting, the father of Hadoop that helps with data exchange between systems, programming languages, and processing frameworks. Avro serializes data having a built-in schema into a binary format, which can be deserialized by any other application. It results in fast serialization of data and also lesser in size. Once again, before confusing between avro and avro schema, lets clear it out that avro is a serialization system which uses avro schema to serialize.

Avro Schema

From a top view, It is a JSON format declared data structures with a schema attached to it. That’s it..

Advantages

  • Schema is defined along with it making it fully typed
  • Schema can also be evolved over time in a safe manner (Also known as Schema evolution)
  • Data is compressed with a less CPU usage and can be read acrossed over any language

I dont see major disadvantages, if you find any please leave it on the comment sections.

Similar with Avro

Avro is similar to Thrift, Parquet, Protobuf, ORC ..etc. Overall all these data formats have the same intentions but we are not going to compare there performance today unless you want to or if there is some performance road block in your program. Coming back to it Avro has a good support for Hadoop based technologies like Hive and also Avro is the only supported data format from Confluent Schema Registry and for Kafka, as we only care of messages for being explicit and fully described (so no ORC, Parquet etc).

How does it look like ?

Here is an basic example of avro schema user.avsc with fields first_name, last_name, age and automated_email.

{
	"type": "record",
	"namespace": "com.example",
	"name": "User",
	"fields": [{
			"name": "first_name",
			"type": "string",
			"doc": "First Name of the User"
		},
		{
			"name": "last_name",
			"type": "string",
			"doc": "Last Name of the User"
		},
		{
			"name": "age",
			"type": "int",
			"doc": "Age of the User"
		},
		{
			"name": "automated_email",
			"type": "boolean",
			"default": true,
			"doc": "Indicaton if the user is subscribe to the newsletter"
		}
	]
}

What just happen above ?

It has some of the common fields often used while defining an avro schema :

  • name : It is the name of your schema, here it was User
  • namespace : It can also be called as the package name
  • doc : Yes, you guessed it right, it is the documentation to explain your schema
  • aliases : If you have any other option names for your schemas, you can use aliases
  • fields
    • name : The name of your field
    • doc : documentation again
    • type : The data type for that field
    • default : You can provide a default value for that field using default

Avro Types

You can get the details of all the supported data types in the documentation. Overall you can get the view below :

Default Values and Logical Types

Some cases which you may encounter :

Default Values is one of the use case of Union where we can have multiple field value to take different types. And in default every field in avro schema are not nullable.

Example : Making middle_name as nullable

{
	"name": "middle_name",
	"type": ["null", "string"],
	"default": null
}

Note : Write null without quotation not as “null”

Logical type is just used to give a more meaningfull name to an already existing primitive types. Most commonly used are:

  • decimals (bytes)
  • date (int) – number of days since unix epoch (Jan 1st 1970)
  • time-millis (long) – number of milliseconds after midnight, 00:00:00.000
  • timestamp-millis (long) – number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC

Example : Click timestamp :

We just have to add “logicalType”: “timestamp-millis” to the field name.

{
	"name": "click_ts",
	"type": "long",
	"logicalType": "timestamp-millis"
}

Doing something with our schema defined :

So far, we have just defined the avro schema. Lets make an avro object out of it:

We can follow two approaches for creating avro objects by creating either:

  • Generic Record or
  • Specific Record

Generic Record

Here the record is created from a schema referenced from a file or a string directly using GenericRecordBuilder

Limitations : It isn’t typesafe.

Example : (Taking the same person schema as above)

    val personAvroSchemaFilePath = "/person.avsc"
    val schemaFile = new File(personAvroSchemaFilePath)

    val parser: Schema.Parser = new Schema.Parser()
    val schema: Schema = parser.parse(schemaFile)

    val myRecordBuilder:GenericRecordBuilder = new GenericRecordBuilder(schema)
    myRecordBuilder.set("first_name", "Jack");
    myRecordBuilder.set("last_name", "Hill");
    myRecordBuilder.set("age", "100");
    myRecordBuilder.set("height", "160");
    myRecordBuilder.set("weight", "55");
    myRecordBuilder.set("automated_email", "false");
    val myRecord: GenericData.Record = myRecordBuilder.build()
    println(myRecord)

Output :

{"first_name": "Jack", "last_name": "Hill", "age": "100", "height": "160", "weight": "55", "automated_email": "false"}

Specific Record

Here you generate the code from the avro schema using plugins or tools.
This is an example of using an sbt dependency avrohugger.
Add this dependency into your built.sbt

  val avroHugger = "com.julianpeeters" %% "avrohugger-core" % "1.0.0-RC22"

And run this function to generate the case class. The code will get generated inside target/generated-sources and according to the namespace you have defined in the avro schema file.

  def specificRecordCodeGenerate(schemaPath: String): Unit ={
    import avrohugger.Generator
    import avrohugger.format.SpecificRecord
    val mySchemaFile =new File(schemaPath)
    val generator = Generator(SpecificRecord, restrictedFieldNumber = true)
    generator.fileToFile(mySchemaFile)
    logger.info(s"Code generated for schema ${schemaPath}")
  }

So now, you can use this case class to create an object and use accordingly.

Example to Read and Write in an Avro format.

Writing

Lets create an output path for the avro file to be written to :

val outputFile = new File("src/main/resources/test-files/person-generic.avro")

Lets use the generic record created in the above example (i.e. myRecord ) and write it to the provided file path :

    val datumWriter = new GenericDatumWriter[GenericData.Record](genericRecord.getSchema)
    try{
      val dataFileWriter= new DataFileWriter[GenericData.Record](datumWriter)
      dataFileWriter.create(myRecord.getSchema, outputFile)
      dataFileWriter.append(myRecord) 
      dataFileWriter.close()
    } catch {
      case ex: Exception =>
        println(ex.printStackTrace().toString)
        ex.printStackTrace()
    }

Reading

Pass the output file provided as above and read the file accordingly :

val datumReader = new GenericDatumReader[GenericData.Record]()
    try{
      val dataFileReader = new DataFileReader(outputFile, datumReader)
      while(dataFileReader.hasNext){
        val readRecord: GenericData.Record = dataFileReader.next()
        println(">>"+readRecord)
      }
    } catch {
      case ex: Exception =>
        println(ex.printStackTrace().toString)
        ex.printStackTrace()
    }

Output:

>>{"first_name": "Jack", "last_name": "Hill", "age": 100, "height": 160.0, "weight": 55.0, "automated_email": false}

For more, you can refer to this repository for the same implementation. For more information regarding avro schema here are some good reads:

Thanks!

References :
Avro Documentation

Knoldus-blog-footer-image