AVRO - Schemas
satya - 7/22/2017, 6:00:12 PM
Here are some AVRO purported qualities
Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
satya - 7/22/2017, 6:04:24 PM
On AVRO Schemas
Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.
satya - 7/22/2017, 6:04:57 PM
An example schema
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
satya - 7/22/2017, 6:07:35 PM
:) why not use an object format
Example.Avro.User {
String name;
int favNumber[null];
String favcoolr[null]
}
//The quotes take away from clarity in my opinion
//Any this is just a gripe
satya - 7/22/2017, 6:08:18 PM
Java can create java classes based on the schema
Java can create java classes based on the schema
satya - 7/22/2017, 6:08:36 PM
You can do then
User user2 = new User("Ben", 7, "red");
satya - 7/22/2017, 6:11:41 PM
Some java serialization example
// Serialize user1, user2 and user3 to disk
DatumWriter<User> userDatumWriter =
new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter =
new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
satya - 7/22/2017, 6:12:39 PM
Deserialization
// Deserialize users from disk
DatumReader<GenericRecord> datumReader =
new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
satya - 7/22/2017, 6:17:10 PM
An example of backward compatibility
//Taken from one of the links above:
//Jon Juanich
//A base record to start with
{
"type": "record",
"name": "Employee",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "emails", "type": {"type": "array", "items": "string"}},
{"name": "boss", "type": ["Employee","null"]}
]
}
//Redefine some fields
//Or remove or add new fields
{
"type": "record",
"name": "Employee",
"fields": [
{"name": "name", "type": "string"},
{"name": "yrs", "type": "int", "aliases": ["age"]},
{"name": "gender", "type": "string", "default":"unknown"},
{"name": "emails", "type": {"type": "array", "items": "string"}}
]
}
See how yrs is an alias for age, and gender is new, and boss is dropped. It is like type casting.
satya - 7/22/2017, 6:17:30 PM
what Opensource solutions are using AVRO?
what Opensource solutions are using AVRO?
satya - 7/22/2017, 6:18:40 PM
Here is IBM's take on AVRO for analytics and big data
satya - 7/22/2017, 6:19:02 PM
AVRO becomes a cross language data exchange ...
AVRO becomes a cross language data exchange ...
satya - 7/22/2017, 6:19:27 PM
AVRO is increasingly used for RPC in distributed systems to communicate data
AVRO is increasingly used for RPC in distributed systems to communicate data
satya - 7/22/2017, 6:20:22 PM
MapReduce connections: Splitting large data sets
Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing
satya - 7/22/2017, 6:20:56 PM
Support for more complex data types
Avro supports a rich set of primitive data types including: numeric, binary data and strings; and a number of complex types including arrays, maps, enumerations and records. A sort order can also be defined for the data.
satya - 7/22/2017, 6:21:56 PM
CORBA, SOAP, Now AVRO RPC!!
CORBA, SOAP, Now AVRO RPC!!
satya - 7/22/2017, 6:29:27 PM
What is AVRO framing?
What is AVRO framing?
satya - 7/22/2017, 6:34:01 PM
Make sure to read through AVRO spec. Lot of interesting detail
Make sure to read through AVRO spec. Lot of interesting detail
satya - 7/22/2017, 6:37:07 PM
Interesting: File structure is quite elaborate to be of practical use and efficiency
Avro includes a simple object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding. Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.
satya - 7/22/2017, 6:38:21 PM
Alright here is a link for Snappy
Alright here is a link for Snappy
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
satya - 8/16/2017, 5:32:15 PM
AVRO RPC, gRPC, Ribbon
AVRO RPC, gRPC, Ribbon
satya - 8/16/2017, 5:39:19 PM
Here is a link on thrift, AVRO, Protobuffers from SOF
satya - 8/16/2017, 5:54:21 PM
Example of a .proto file
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
satya - 8/16/2017, 5:58:03 PM
Protobuffers and Avro
Protobuffers and Avro
satya - 8/16/2017, 6:02:35 PM
A good comparison AVRO, Thrift, Protobuffers
satya - 8/16/2017, 6:17:07 PM
So far.....
Protobuffers is since 2001 and one of the first serialization schemes for backward compatible data. Its definitions closely resembles Java classes. Very clean and easily understood. The definitions do not carry with binary data. So a client must generate classes at compile time to read the data back in. This is a bit unlike XML. For example as an XML serialization contains the tag names you can map them to objects without an explicit compile. Although compilation is quite possible as in early xml implementations. It is developed by google for its internal ipc. Key goal for them is backward compaitbility where one can add new fields and don't have to regenerate old clients. Recently google also has announced gRPC that works with protobuffers.
Thrift is further refinement of a similar idea, and I hear that, it is also developed by an ex googler but with Apache and widely used at Facebook. Thrift extends protobuffers with full RPC and seem to be far more suitable for php etc. Thrift offers similar compatibility and efficiency benefits.
AVRO is part of the Hadoop ecosystem and embeds the structure definitions into the transport file itself making it more dynamic. Compatibility is further extended. Its structure is in JSON. No need to generate stubs or clients although code generation can be done to help. AVRO also has the ability to embed multiple streams where objects don't have to be read in a sequence. AVRO is also being used as RPC in the Hadoop ecosystem and things like Kafka. It appears to be more intriguing and more capable.