Category Archives: avro

Apache Flume – Multiplex event on multiple HDFS Files

I'm pretty new to Apache Flume and HDFS. I'm trying to ingest events (contained in a file on the local FS) in HDFS.

The file contains rows with numbers separated by a comma. The last number is the one that I want to use to create separated files on HDFS. One file for each number. I don't know how many different numbers are in the records.

So I wrote this.

agent.sources = src1
agent.channels = chan1
agent.sinks = sink1

agent.sources.src1.type = spooldir
agent.sources.src1.channels = chan1
agent.sources.src1.spoolDir = /home/me/spooldir

agent.sources.src1.interceptors = idI

agent.sources.src1.interceptors.idI.type = regex_extractor
agent.sources.src1.interceptors.idI.regex = (\d)(\r|\n|\s)
agent.sources.src1.interceptors.idI.serializers = t
agent.sources.src1.interceptors.idI.serializers.t.name = id

agent.sources.src1.selector.type = multiplexing
agent.sources.src1.selector.header = id
agent.sources.src1.selector.default = chan1

agent.sinks.sink1.type = hdfs
agent.sinks.sink1.channel = chan1
agent.sinks.sink1.hdfs.path = hdfs://localhost:9000/ingest/
agent.sinks.sink1.hdfs.filePrefix = id-%{id}
agent.sinks.sink1.hdfs.fileType = DataStream
agent.sinks.sink1.hdfs.rollInterval = 0
agent.sinks.sink1.hdfs.rollSize = 0
agent.sinks.sink1.hdfs.rollCount = 0
agent.sinks.sink1.hdfs.idleTimeout = 0

agent.channels.chan1.type = memory

The problem is that I got only a file with all the records named id-.timestamp so the "id" inside the header is not set!

I want to have as many file as the numbers are. How can I do it? Or can you suggest me a better framework to do so?

Besides, I want to serialize the events in Avro format. But it's not clear to me how to do so. Where I have to define the schema?

Thank you very much in advance.

Apache avro python fails when parsing union schema

sorry but I am new to JSON. I have the following schema

{
    "namespace": "some.namespace.cluster.message",
    "name": "Message",
    "type": ["null", "int"]
}

When parsing this with avro.schema.Parse, I receive the following error message:

TypeError: unhashable type: 'list'

Looking at the code, the union is actually stored as list object, which of course won't make a key within the if-statement that throws this error. But I cannot see what is wrong with the schema or how a union would be detected by the parser. Is this a bug?

Avro Writer in Python 3.5

I'm attempting to create an Avro based on an existing schema, but I'm getting an error while using the syntax from the tutorial located here. The console is saying the 'bytes' object has no attribute 'to_json', but it's deep within the Avro library. Are there any good workarounds or ways to fix this error?

Full error:

Traceback (most recent call last):
File "build_data.py", line 15, in <module>
al.create_avro(logs)
File "AppLog.py", line 57, in create_avro
writer = DataFileWriter(open("new.avro", "wb"), DatumWriter(), schema)
File "/usr/local/lib/python3.5/dist-packages/avro_python3-1.8.1-py3.5.egg/avro/datafile.py", line 151, in __init__
self.SetMeta('avro.schema', str(writer_schema).encode('utf-8'))
File "/usr/local/lib/python3.5/dist-packages/avro_python3-1.8.1-py3.5.egg/avro/schema.py", line 266, in __str__
return json.dumps(self.to_json())
File "/usr/local/lib/python3.5/dist-packages/avro_python3-1.8.1-py3.5.egg/avro/schema.py", line 808, in to_json
to_dump['items'] = item_schema.to_json(names)
AttributeError: 'bytes' object has no attribute 'to_json'

Code:

schema = avro.schema.ArraySchema(open("AppLogs.avsc", "rb").read())
writer = DataFileWriter(open("new.avro", "wb"), DatumWriter(), schema)

Thanks for your help.

How to use ParquetReader with path containing other directories?

I have a hdfs path structure like this: _P_ROOT=foo/_P_A=4444/_P_B=1/_P_C=20, where all my parquet files sits under _P_C. P is a partition.

With Spark I am able to specify a path up to any level, e.g.: only _P_ROOT, _P_A, or _P_B. e.g.: val df = sqlContext.read.parquet("_P_ROOT=foo/") and my df will have other directories as columns.

Now, I have like to read parquet without Spark.

I am using AvroParquetReader thanks to Read Parquet files from Scala without using Spark

Here is a snippet of my code:

val reader = AvroParquetReader
  .builder[GenericRecord](path)
  .build()
  .asInstanceOf[ParquetReader[GenericRecord]]

val iter: Iterator[GenericRecord] = Iterator
   .continually(reader.read)
   .takeWhile(_ != null)

This works if my path is all the way up to _P_C. However, if I specify a path that contains other directories, I will get an empty iterator.

So my question is: How can I specify a path to read parquet when the path is not a leaf directory. Without Spark of course.

java.math.BigDecimal to Avro .avdl file

I'm having trouble writing an Avro schema for java.math.BigDecimal type, I tried the following:

  1. Based on Avro official doc, I know I need to define Logical Types myself to support BigDecimal, but that link gives example only in avsc, I'm trying to figure it out in avdl.
  2. Based on Avro doc and this example, I wrote below avdl:

`

@namespace("test")
protocol My_Protocol_v1 {

  record BigDecimal {
        @java-class("java.math.BigDecimal") string value;
  }

`

But it's not working: This IDL schema compiles fine and can generate a Java class called BigDecimal, but I cannot really use the generated BigDecimal as java.math.BigDecimal, what's wrong? or How should I do it?

Thanks a lot

Import data from .avro files to hive table

I created a hive table by following command and avro schema i had.

CREATE TABLE table_name
PARTITIONED BY (t string, y string, m string, d string, h string, hh string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://location/schema.avsc');

Now i want to dump data i have in HDFS to created table.

I have an HDFS location, where i have data in a directory structure as t/y/m/d/h/hh/data.avro I have multiple directories according to partition because that are my partition columns.

I want to dump all the data into created table.

I tried using External table, but it is giving exceptions.

Oracle + NiFi => all fields converted to String

I'm using NiFi to transfer Oracle (11g) database tables to HDFS (Avro format).

Something goes wrong with the typing of the AVRO columns: they are all defined as String, even when the Oracle tablecolumn is of another type like Numeric, Timestamp of Datetime. Clearly, this is annoying :-)

After some googling, I found suggestions that the problem exists in the combination of certain Oracle JDBC drivers and the converter to Avro. However, I could not find a proper solution; does anyone know what JDBC (or other?) driver to use in order to have correct typing in the Avro output?

Why Apache Avro don’t support the same naming schema of fields as in JSON?

for one of my opensource projects I would like to write a plugin for Avro but my project heavily used the naming convention of dollar suffixed variables. According to the Avro specification the naming is much stricter as in JSON. I would like to know why they did this decision although the Avro schemas are defined with JSON.

Extract of the specification:

Names Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname.

The name portion of a fullname, record field names, and enum symbols must:

start with [A-Za-z_] subsequently contain only [A-Za-z0-9_]

{
  "meta$": 1
}

will fail Error: invalid field name: meta$