Category Archives: beam

Test pipeline comparing objects using PAssert containsInAnyOrder()

I'm writing JUnit tests for an Apache Beam pipeline. I started using DoFnTester but that's been deprecated although the examples in the Apache Beam "Test Your Pipeline" documentation refer to methods that have since been deprecated.

I'm now using the recommended TestPipeline and PAssert but am having difficulty with the PAssert as all the examples I've seen use Strings.

The pipeline outputs objects so the test fails as its comparing objects. My first instinct was to create a derived object and override equals(). I couldn't get that to work (I've only been using Java for a couple of weeks so maybe that doesn't help) and the test was still calling the original equals() method.

I then thought of iterating through the output PCollection as I was using the test pipeline and there would only be one or two elements at most in the pipeline but I couldn't find any examples of how to do this. Not sure if this is the right way to test anyway.

I've read through the Apache documentation and found this comment "Object.equals(Object) is not supported on PAssert objects. If you meant to test object equality, use a variant of containsInAnyOrder(T...) instead.". That sounds exactly what I need but the problem is I don't know how to create a variant of containsInAnyOrder() to do this and I couldn't find any examples.

I've run out of things to google! A lot of the examples are out of date (including Apaches own documentation) referencing methods that have been deprecated. All the examples I've found with PAssert assume the output elements are Strings.

All I need to do is override containsInAnyOrder so I can do my own object comparison. Can anyone help?

PAssert.that(output).containsInAnyOrder(expected);

If anyone could point me to some examples that would be much appreciated.

Read Json with Beam

Is it possible to read "Json" with Apache Beam ?

Actually, I'm reading Text files with TextIO.read() and storing the result into PCollection.

I would like to know if is it possible to read Json files and store it in PCollection without multiple transfers that wouldn't be very optimal.

Thanks