Continuous data pipeline – continuous delivery and data engineering

July 4, 2016 bigdatanerdLeave a comment

Data engineering and continuous delivery:

We are witnessing the evaluation of web from web 2.0 with social engagement to self intelligent data driven applications. whether it is a retail app or CRM or healthcare, all the applications will be driven by data. The quest to provide personalized experience increases the adoption of big data.

The adoption of big data increases exponentially so as the complexity of data pipeline. The change of strategy increases the consumption of data across various sources produce internally as well as externally. A reliable and agile data pipeline is a backbone for an organization to move quickly and win the clients.Continuous data pipeline principles are more important than ever in data engineering.

The current challenges with data pipeline:

Cascading system failure:

Data pipeline is a continuous delivery system following the principles of workflow patterns. Any fault behaviour of the system component in the upstream can potentially affect all the downstream. This will lead to cascading system failure and create bad user experience.

High Risk releases:

The lack of continuous delivery system decreases the confidence in release cycle. One need to put multiple level of check before pushing a job in to production. The manual process increases bureaucracy and reduce greater agility.

Delayed time to market:

The cascading failure and high risk releases add more complexity to data pipeline engine. This will ultimately leads to delay to market as simple change request need multiple cautious effort to deliver.

High cost of maintenance:

the lack of continuous delivery system result in creating experts of the system. This will create technical debt as knowledge of the system not spreader across equally. This increases hiring and retaining specialist that brings high cost as well.

Continuous data pipeline delivery system:

Snip20160703_1

Build:

Unit testing, Integration testing and code coverage enable high level confidence on individual code that we deliver. Map Reduce framework has MR Unit as a unit testing framework. Cloudera has very good blog post on unit testing in Apache Spark with Spark Testing Base. HBase Mini cluster provide comprehensive integration testing utility for HBase. Kafka support unit and integration testing via Kafka server. The Jarvis project have the complete code example covering various integration testing utility. It is a best practice to follow a “Two Vote code review” process. This not only reduce the risk but also spread the knowledge across the team there by eliminate technical debt.

Automated acceptance testing:

Snip20160703_4

Microcosm testing (Known set of input / Known set of output) is a critical backbone of the continuous data pipeline. The data pipeline either doing a data transformation or data cleaning during their life time before consuming data from source and sink it to another data storage engine. The data transformation and data cleanup will always be depends on any business rules. It is important to have a solid microcosm testing system after the build to make sure we are not breaking the business rules through out the data pipeline. Microcosm testing system will give high degree of confidence in terms of business functionality and ability to support other data applications depends on it.

Automated Workflow planning:

It is surprising to see most of the open source workflow scheduling engines (Oozie, Airflow) build without any intelligence around it. One of the common problem in a complex data pipeline is that we need to know complete lineage of the jobs. We need to know the lineage of the jobs because when we deploy we need to approximately time it to make sure all the dependencies were satisfied.

Capacity planning is an another major pain point with workflow systems. We do have some very good visualization to show job start time and end time with resource utilization etc. Continuous data pipeline requires an intelligent workflow scheduling engine which automatically understand the dependency lineage, capacity of the cluster and SLA associated with the pipeline and act on it.

Manual Testing:

Data pipeline requires manual testing to make sure the system we build functioning correctly and we have 100% confident to deploy job. Manual Testing often very important for new feature development where we have not yet established comprehensive automated verification or important business rule changes on the data pipeline. It is important that we should able to expose any part of the data on the data pipeline to Adhoc query engine. Adhoc query engine like Presto, Impala and Drill make information consumption lot easier without disturbing the data pipeline. Manual testing often carried out against sample set of data or random sampling.

Deploy Job:

The best practice is to treat staging and production environment as candidate 2 and candidate 1. After a feature gone through manual testing the artifact promoted from snapshot to staging (Candidate 2) environment. If the job run successfully and satisfies the performance requirement, the artifact will get promoted to production (candidate 1) environment.

Continuous Monitoring:

Continuous monitoring is an important part of continuous data pipeline. In a typical organization environment where there will be multiple team producing different data sources. There will be multiple teams consuming those data to power their services. It is important to provide data lineage, data quality, clear ownership of data and data dictionary about the data. A new artifact can get deployed multiples times in a day that could potentially change these information. It is important to automate these services so that downstream jobs easily track the changes. These tools can produce greater visibility and transparency to the entire system. Twitter has some very good blog post about their continuous monitoring system here.

An Introduction To Enterprise data lake – The myths and miracles

January 1, 2016 bigdatanerd1 Comment

Data lake : A brief history

The Big Data lake term coined by James Dixon, The CTO of Pentaho. Though the initial term coined to contrast with data mart, soon it became a very popular term on the big data world. PWC subsequently told that data lake could potentially end the data silos which is a major concern for enterprises. Given the maturity of the concept and technology, there are very less projects got successfully deployed as a big data lake. The rush to get in to their hands on big data and market them self as a big data company, many started to dump all the data in to HDFS and over the period of time started to forget them. The key to success is not dumping all the data, but creating a meaningful data lake that can increase the speed of extracting the value out of it.

Data lake is not just a storage or processing unit, it’s a process to unleash the value of data.

Why we need big data lake?

Every industry has a potential big data problem. In the digital era with social media and IOT technologies, customers now interacting across variety channels. The interaction leads to create what we call the big data. Creating a 360 degree view and establish a single source of truth about their clients is a nightmare for most companies. The importance of data lake can be summarized by a quote below,

Every product and service will go digital, creating vast quantities of data which may be more valuable than the products themselves.-Steve Prentice (Gartner Fellow)

The life cycle of data lake

data lake - New Page

The data lake life cycle in itself is iterative in nature. A typical data lake follow 3 step process and keep getting iterated.

1. Data source integration:

The data lake process starts with the data ingestion process. The data ingestion always done at a very granular level of an event without any assumption about the data. Data ingestion process often referred as “As it happened mirror” of your data source. The nature of big data with volume, variety and velocity increases the complexity of data integration.We no longer have traditional RDBMS alone as a data source. The data lake creation start with a handful of identified business critical data sources and later adding more data sources. This enable simplification of the complex data ingestion process.

Complexities of data ingestion process:

When we add a new data sources, we may not know the business process which act on the data sources. Data storage optimization will be challenge since we may not know the access pattern upfront. Data sources may include complex data types which are hard to convert to relational structure upfront without knowing the significance of the data.

Iterative Data ingestion pattern:

Data ingestion process includes data de-duplication and data enrichment process as well. The business process identification yields data access pattern. The findings of data access pattern then looped in to data ingestion strategy to enhance data de-duplication and data enrichment process. The initial output of data ingestion process yield loosely coupled, complex entities which get enhanced over period as denormalized, flattened, enriched and easily query-able dataset.

Technologies: Apache Kafka, Apache Nifi, Apache flume, Apache Sqoop and Druid.

2. Business process discovery:

Business process discovery is the important process of data lake creation. The true value of a data lake can be realized only if we can make the business process discovery achieved without greater efforts.The Business discovery process start with the exploratory analysis to query the data and identify the hidden value out of it. Data stewards and business analyst also plays a vital role on it to exploring the data by providing and gaining valuable insights. The exploratory analysis tools often a MPP query engine with SQL-like abstraction. The exploratory analysis can be performed to achieve the following objectives.

Validate a business process theory
Discover a new business process
Derive business intelligence via descriptive analysis
Serve as a foundational platform for predictive analytics.

Technologies: Impala, Presto, Drill and Apache Pig

3. Serving data products with data insights store:

Once the business process identified we need to create data store that can be easily serve as a data layer of an application. We can closely related data insights store with data marts. data insights store often tend to be highly normalized, optimized for particular business process access pattern. Though the data insight store tightly coupled with a business process it is important to identify “conformed dimensions” across business process. This will significantly reduce the computation need for each business process to derive the insights. It is also recommended to store the roll up dimensions relationship along with the data insight store in order to reduce the need for duplicate computations.

Technologies: Apache HBase, Elastic search and other nosql storage engines.

Data warehouse vs Data lake:

No	Data Warehouse	Big data lake
1	The process starts with business process identification often driven by data stewards and business owners with the certain assumption of data and business.	In the big data lake world, no assumption been made about the data. We start collecting the data at the granular level as it happened. Business process discovery happens based on data with the input from data stewards and business owners
2	Database schema evolution is very hard given the nature of relational data systems	Complex data types support and ability to rebuild the relationship is much easier
3	Very static since the business process drive the design	Very dynamic since business process identified based on data
4	Roll up and drill down analysis is harder since in order to reduce the complexity of data, the design may need to compromise certain granularity of data	Exploratory analysis is much simple since data been collected at a granular level
5	Serves predefined business needs	Ignite innovation and new business opportunity
6.	Limited complex data types support	Supports structured, semi structured and unstructured data

Big data lake, will it replace traditional data warehouse?

The politically correct answer to this is big data lake is a complementary to data warehouse. Well this is true to a certain extends as many companies have well established data warehouse system and big data systems still very young but growing rapidly. The big data lake will grow hand on hand with data warehouse system for a certain period. The enterprises sooner or later will mature enough to handle the big data lakes and maintaining two system will become redundant. One can argue that the data warehouse can be one of the data source for big data lake, then that’s a totally wrong design since you already made some assumption about your data while designing your data warehouse system. I believe big data lake eventually make data warehouse redundant but data warehouse concepts like dimensional modeling will get well adopted by big data lake system. The big data lake is just another evolution of data warehouse.

LAMBDA ARCHITECTURE – PART 2 – LAMBDA ARCHITECTURE

April 10, 2014January 2, 2016 bigdatanerdLeave a comment

Over the last couple of years the innovative tools that has emerged around big data technologies were immense. Each tool has its own merits and demerits. Each tool need fair amount of expertise and infrastructure management since it is going to deal with large amount of data. One architecture philosophy I always like is “Keep it Simple”. The primary motive behind this design is to make sure there should be only one enterprise data hub management software to fit Lambda Architecture in to it. These are my thought process of how we can fit Lambda architecture with in Cloudera enterprise data hub.

For brief introduction about Lambda Architecture, Please see part-1 of Lambda Architecture.

Lets walk through each layers in Lambda Architecture and examine what tool we can use with in Cloudera distribution.

Data Ingestion Layer:

Though Lambda architecture doesn’t speak much about Data Source and Data Ingestion Layer, during my design I found understanding this layer is very important.

Before going to choose the tools for data ingestion, it is important to understand the nature of data sources. We can broadly classify data sources in to four categories.

1. Batch files

Batch files are periodically injected data in to a file system. In practical sense we used to consume it as a large chunk of data periodically (typically once in a day). Example of these files like XML or JSON files from external or internal systems.

DB Data:

Traditional warehouse and transaction data usually been stored in to a RDBMS. This will be well structured data.

Rotating Log files:

Rotating Log files usually machine generated data which keep appending immutable data in to the file system. In most of the use cases it will be either structured or semi-structured data

Streaming Data:

I would say the modern data source. Streaming data usually accessed by a fire hose API, which keep injecting the data as it comes. A good example would be Twitter fire hose API.

Technology choice:

Apache Flume for Rotating Log files, batch files and streaming data.

Apache Sqoop for getting data from databases.

Speed Layer:

Technology choice: Spark Streaming and spark eco system

Spark is phenomenal with it’s in memory computing engine. One could argue in favor of Apace Storm. Though I’ve not used Apache Storm much, Spark stands out it’s concept of “data local” computing. The amount of innovation with in Spark core context and Spark RDD made Spark is a perfect fit for Speed Layer (Mahout recently announced they going to rewrite Mahout with spark Eco system).

Batch Layer:

Technology Choice:

Master data: Apache Hadoop Yarn – HDFS with Apache Avro & Parquet

Batch View processing: Apache Pig for data analytics. Apache Hive for data warehousing and Cloudera Impala for fast prototyping and adhoc queries. Apache Mahout of machine learning and predictive analysis

Apache Yarn is a step ahead of Hadoop Eco system. Its clear segregation of map reduce programming paradigm and HDFS made other programming paradigms play on top of it. It is important to move to Yarn to keep the innovation open on your big data enterprise data hub as well.

Data serialization is an important aspect when we maintain the big data system. It is important to force a schema validation before storing data. This will reduce surprises when we do analytic on top of it and save lot of development time.

Columnar Storage with Parquet. Hadoop designed to read row by row. The master data design will be a de-normalized data design hence there will be N number of columns in a row. When we do analysis we don’t want all the data to be loaded in the memory. We need those data which we really required. Parquet enables us to load only the data we require in memory to help increase the processing speed and efficient memory utilization. Parquet has out of the box integration with Avro as well.

Apache Spark, Apache Pig, Apache Hive and Impala having out of the box integration with Parquet as well.

Servicing Layer:

Technology Choice: Apache HBase

The only NoSQL solution on Hadoop eco system. This is a bit of tough choice. Servicing layer need to be highly available since all the external consumer facing application will access it. HBase Master / Slave architecture make it little tough and it need a lot of monitoring. Region Failure, MTTR (Mean time to recover), high availability of Master node are some of the concerns while maintain HBase. There are a lot of activity happening to make HBase master highly available and improve MTTR.

Lambda architecture – Part 1 – An Introduction to Lambda Architecture

April 9, 2014 bigdatanerd1 Comment

In last couple of year people were trying to conceptualize big data and business impacts of it. Companies like Amazon and Netflix pioneered in this space and delivered some of the best products to its customers. We should thank to Amazon for bringing in data driven business to end consumer market. The big data paradigm emerged from a conceptual understanding to real world products now. All the major retailers, dot-com companies and enterprise products focus on leveraging big data technologies to produce actionable insights and innovative products out of it. The system emerged to the extends potentially replace traditional data warehousing solutions.

How this big data shift happened?

It is fundamental design thinking of how we store and analyses data. The moment you start to think that the data is,

Immutable in nature
Atomic in nature, that one event log is independent of another events.

Traditional databases were designed to store the current state of an event (with its update nature and data structure in beneath to support it). This made traditional RDBMS systems not fit in to the big data paradigm. There are numerous NoSQL solutions started to flow in to address the problem (See my earlier blog post on HDFS vs RDBMS).

Now we need an architectural pattern to address our big data problem. Nathan Marz proposed Lambda Architecture for big data. In this two part blog post I’m going to brief overview of Lambda architecture and its layers. In the second post I’m going to walk you through my thought process of designing Lambda Architecture with Cloudera Hadoop Distribution (CDH).

“Lambda” in Lambda Architecture:

I’m not sure the reason behind the name Lambda Architecture. But I feel “Lambda” perfectly fit here because “Lambda” is a shield pattern used by Spartans to handle large volume, variety and velocity of opponents. (Yeh 300 movie impact 🙂 )

Picture : Lambda Architecture

Layers in Lambda Architecture:

Lambda architecture has three main layers

Batch Layer
1. The storage engine to store immutable, atomic events
2. The batch layer is a fault tolerance and replicated storage engine to prevent data lose
3. The batch layer support running batch jobs on top of it and produce periodic batch views to the serving layer for the end services to consume and query
Speed Layer
1. This is a real-time processing engine.
2. Speed layer won’t persist any data or provide any permanent storage engine. If raw data processing via speed layer need to be persisted it will persist in master data.
3. Speed layer process data as it comes in or with specific short time interval and produce real-time view in to servicing layer
Servicing Layer:
1. Servicing layer will get updated from batch layer and speed layer either periodic or in real-time
2. Servicing layer should combine results from both speed layer and batch layer to provide unified result.
3. Servicing Layer usually a Key / Value storage and in-memory storage engine with high availability.

Hive, Impala and Presto – The War on SQL over Hadoop

November 19, 2013November 20, 2013 bigdatanerd6 Comments

I feel the logo of an infant elephant for Hadoop is not opt now. It is well established and growing faster and stronger. Some people getting along up to the speed and some find it hard to grow faster. To bridge that gap, there is enormous activity going on to bring traditional SQL over the Hadoop. Facebook started to develop Hive around 2007 and opensource it in the end of 2008. Ever since the popularity of SQL over Hadoop is growing. On October 2012, Cloudera announced Impala which claim to be near real time Adhoc bigdata query processing engine faster than Hive. Facebook again jump in to the picture and announced Presto last month. There is an open source project called Apache Drill also focusing on Adhoc analysis.

Lets take a look at the bigger picture how these system interacting with the larger Hadoop ecosystem.

Overall architecture of Hadoop, Hive and Impala

In short, Hive converts the HiveQL query language in to sequence of MapReduce jobs to achieve the results, while Presto and Impala follow the distributed query engine processor inspired by Google Dremel paper.

HiveQL:

One of the common thing one could found among all three systems are, it all support on common standard called HiveQL (need a better common name soon?). Though HiveQL is based on SQL, it’s not strictly support the SQL-92 specification.

How hive works?

Hive maintain it’s own metadata storage where it keep metadata information about schema definition, table definition, name node that contains the respective date etc.

There is Hive meta data storage client, that expose all meta data information as a service. It can be accessed by thrift, that make Hive Meta store is inter operable with external systems. This gave an advantage for impala and Presto to use the existing infrastructure and build on top of it.

Hive gets the query in the format of HiveQL, parse it and convert that in to series of Map / Reduce Job.

How Impala & Presto works?

Both Presto and Impala leverages the Hive meta store engine and get the name node information. It then talk directly to the name node and hdfs file system, and execute the queries in parallel. It then merges and stream the result back to the user. The entire process happen on memory, there by it eliminate the latency of Disk IO that happen extensively during MapReduce job.

The comparison:

Hive

Advantage	Disadvantage
It’s been around 5 years. You could say it is matured and proven solution.	Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations
Runs on proven MapReduce framework	Hive still not support multiple reducers that make queries like Group By and Order By lot slower
Good support for user defined functions	Lot slower compare to other competitors.
It can be mapped to HBase and other systems easily

Cloudera Impala:

Advantage	Disadvantage
Lighting speed and promise near real time adhoc query processing.	No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails.
The computation happen in memory, that reduce enormous amount of latency and Disk IO	Still no UDF support
Open source, Apache licensed	custom SerDes not yet supported.

PrestoDB:

Advantage	Disadvantage
Lighting fast and promise near real time interactive querying.	It’s a new born baby. Need to wait and watch since there were some interesting active developments going on.
Used extensively in Facebook. So it is proven and stable.	As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development.
Open Source and there is a strong momentum behind it ever since it’s been open sourced.	Still no UDF support yet. This is the most requested feature to be added.
It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1.

what to watch next?

This is the most happening field in Big data analytic field as now. This blog contents may not be relevant after one month, since the amount of activity going on all these platforms. Some of the interesting stuff we can watch over is,

1. Hortonworks Stinger project : Hortonwork put their bet on Hive and they started an initiative to improve Hive 100X faster. They already delivered two milestones and working on their final phase. They aim to integrate Hive in to another opensource project called Apache Tez, which is again a distributed query engine.

2. Cloudera is also contributing much on Stinger project. It will be interesting to see their approach over Impala on it.

3. What will happen to Drill project, if Presto getting in to Apache Incubator (I’m sure it will be soon)

4. How popular Presto will grow.

Lets watch and see 🙂

Edit

Thanks Greg and Justin. Yes I was wrong about Impala License. I found it in their blog answer here and in the quora answer as well.

MapReduce – Running MapReduce in Windows file system – Debug MapReduce in Eclipse

November 14, 2013 bigdatanerd4 Comments

The distributed nature of Hadoop MapReduce framework make the debugging little harder. Often we want to test our MR jobs in a small amount of data before deploThere are some good tutorials to configure Hadoop development with Eclipse. The major concern with the HDFS file system nature, it is hard to map the debugger in the windows environment. This is a little hack, that will make Hadoop to understand or take input from the windows file system and run the map reduce job locally. This will faster and flexible way of developing.

Lets extend the LocalFileSystem and override with our windows file system


package org.ananth.learning.fs;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.permission.FsPermission;
import java.io.IOException;

import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.permission.FsPermission;

public class WindowsLocalFileSystem extends LocalFileSystem{


 /**
 *
 *
 */
 public WindowsLocalFileSystem() {
 super();

}


 public boolean mkdirs (
 final Path path,
 final FsPermission permission)
 throws IOException {
 final boolean result = super.mkdirs(path);
 this.setPermission(path, permission);
 return result;
 }


 public void setPermission (
 final Path path,
 final FsPermission permission)
 throws IOException {
 try {
 super.setPermission(path, permission);
 }
 catch (final IOException e) {
 System.err.println("Cant help it, hence ignoring IOException setting persmission for path \"" + path +
 "\": " + e.getMessage());
 }
 }


}

Then all you need to do on your driver class is,


package org.ananth.learning.mapper;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MutualfriendsDriver extends Configured implements Tool{

/**
 * @param args
 * @throws Exception
 */
 public static void main(String[] args) throws Exception {

 ToolRunner.run(new MutualfriendsDriver(), null);
 }

 @Override
 public int run(String[] arg0) throws Exception {
 Configuration conf = getConf();
 conf.set("fs.default.name", "file:///");
 conf.set("mapred.job.tracker", "local");
 conf.set("fs.file.impl", "org.ananth.learning.fs.WindowsLocalFileSystem");
 conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
 + "org.apache.hadoop.io.serializer.WritableSerialization");

 Job job = new Job(conf,"Your Job name");

// Set your Mapper and Reducer for the JOB

// Set your input and output class

 FileInputFormat.addInputPath(job, new Path("input"));
 FileOutputFormat.setOutputPath(job, new Path("output"));
 job.waitForCompletion(Boolean.TRUE);
 return 0;
 }

}

The Path, input and output should be located on the project root directory. Now you all set, you can run the MR job in you windows local machine.

TDD – An Introduction to Test Driven Development

November 8, 2013 bigdatanerdLeave a comment

In my previous posts I’ve covered some of the automated integration testing frameworks such as Arquillian and Selenium. I’ve explained how it is important to have a solid automated testing to do continues code refactor and evolving technology architecture based on dynamic business needs. In this post I’m going to share some of my view and understanding of Test Driven Development (TDD)

What is TDD ?

ever since Kent Beck introduced JUnit (Rated as one of the top 5 tool for Java Technology ever made), and he rediscovered the whole Test Driven Development progress. According to the Wikipedia definition, TDD is a development process which relies on the repetition of a very short development life cycle. The developer write the test case showing how the system fails and then refactor the code to make it success.

TDD Process :

Many misunderstood TDD is all about writing test cases. TDD is a process that differentiate software engineering from plain programming. It has the following itineraries.

1. Plan:

Read the requirement and business use case. plan what method you going to implement and how you going to implement it.

2. Write a test case to fail:

This is very important. Once you done the planning, don’t jump in to implementation. Write test cases for that function and show what way it can fail.

3. Implement the functionality:

Now you refactor the implementation code as per the requirement.

4. Write test cases to pass:

Now the method already fool-proof. we have covered all the scenario, how not to fail. Run the test cases and see it run successfully.

5. Repeat (1 – 4)

TDD basically enforce the very basic of the software programming, “Code for Failure”. If you are new to software programming, a typical method should look like

function x(int x) {

< pre condition> (what should we do if we get x value undesirable)

Your Business logic

<Post condition> (did you got the desired result)

}

Lets Take a simple example. We need to implement a simple divider function, that take two integers as input and produce the division as output. We have simple business validation.

1. The denominator should be zero

2. The result should not be in negative number. (which means neither of the variable should be in negative)

Lets do a TDD.

1. Plan:

As we have two business use case validation, we need to have a custom exception class.

Write a simple method that will take two integer parameter and do the division operation.

2. Write Test case to fail

Lets write the basic class now.

DataFlow.java (The exception class)


package org.ananth.learning.tdd;

/**
 * The is the custom data exception
 * @author Ananth
 *
 */

public class DataException extends RuntimeException{

 public DataException(String message) {
 super(message);
 }

}

</pre>
package org.ananth.learning.tdd;

/**
 * Simple divider implementation
 * @author Ananth
 *
 */

public class SimpleDivider {

 /**
 * Take integer A,B and result the divider
 * @param a
 * @param b
 * @return
 */
 public Integer divide(Integer a, Integer b) {

return a/ b;

 }

}
<pre>

Now the test cases to fail


package org.ananth.learning.tdd.test;

import static org.junit.Assert.*;

import org.ananth.learning.tdd.DataException;
import org.ananth.learning.tdd.SimpleDivider;
import org.junit.Test;

/**
 * Test methods for simple divider
 * @author Ananth
 *
 */
public class SimpleDividerTest {

 /**
 * Denominator Zero
 */
 @Test(expected = DataException.class)
 public void testZeroDivisor() {
 new SimpleDivider().divide(10, 0);
 }

 /**
 * Negative denominator and positive Numerator
 */
 @Test(expected = DataException.class)
 public void testNegetiveDivisorA() {
 new SimpleDivider().divide(10, -2);
 }

 /**
 * Negative Numerator and positive denominator
 */

 @Test(expected = DataException.class)
 public void testNegetiveDivisorB() {
 new SimpleDivider().divide(-10, 2);
 }

 /**
 * Negative Numerator and denominator
 */

 @Test(expected = DataException.class)
 public void testNegetiveDivisorAB() {
 new SimpleDivider().divide(-10, -2);
 }


 /**
 * Actual Test to pass
 */
 @Test
 public void testDivisor() {
 assertEquals(new Integer(5),new SimpleDivider().divide(10, 2));

 }

}

Now if you run the test cases you can see except the last test case all the test cases been failed. Because we have not build out implementation method for failure.

Step 3: Refactor the code.

Now I’ve refactor the implementation method to include precondition to handle failures.


public Integer divide(Integer a, Integer b) {

 if(b == 0) {
 throw new DataException("Can't allow zero as divisor");
 }

 if(a < 0 || b < 0) {
 throw new DataException("Values can't be in negative");
 }

 return a / b;

 }

Now you can see all the precondition has been properly implemented and exceptions been thrown.

Step 4: See the test pass through

Now you can rerun the test cases and see everything pass through.

Step 5:

Take another modular method and repeat step 1-4.

Happy TDD!!!!

Continuous code re-factoring – part-2 An Introduction to Selenium web automation testing

October 28, 2013 bigdatanerd2 Comments

Introduction:

Browser automation testing play a major role on integrated testing. To measure quality of our web application we need certain quality metrics . We should able to mock the behavior of end user, the way they navigate and the way they consume information.

Why we need browser automation?

Simply the evaluation of Front End Technologies. There are some pretty good javascript based frameworks like blackbone.js, knockout.js and Google’s very own Angular.js offer excellent support for MVC and MVVM frameworks. These frameworks redefine User Experience on web applications, so it is important to have a Browser based automation testing to measure the product quality.

Selenium:

Selenium was originally developed by Jason Huggins in 2004. It is a open source software and has the following components in to it.

Selenium IDE:

An Integrated Development Environment, developed as a Firefox extension. It can record,edit and debug test. We can also produce executable code. We can export client code in Java,Ruby and other popular languages as well.

Selenium Client API:

A scripting SDK (available on most of the premier languages), that provide more control for the testers. We can see an example in java on that in later part of this article.

Selenium Web Driver :

A successor of Selenium RC, Web driver accepts urls from Client API and send it to the browser. It is the main interface between your test cases and browser. Selenium has web drivers for almost all the major web browsers such as Firefox, Chrome and so on.

Hello World Selenium :

The following program will open up Google home page and search for Java complete reference. We are using Firefox Driver to test it out.

Maven Dependency:


<dependencies>

<dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-firefox-driver</artifactId>
 <version>2.32.0</version>
 </dependency>

<dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-server</artifactId>
 <version>2.32.0</version>
 </dependency>

<dependency>
 <groupId>org.apache.httpcomponents</groupId>
 <artifactId>httpcore</artifactId>
 <version>4.2.3</version>
 </dependency>

<dependency>

<groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-java</artifactId>
 <version>2.35.0</version>
 </dependency>

<dependency>
 <groupId>commons-lang</groupId>
 <artifactId>commons-lang</artifactId>
 <version>2.6</version>
 </dependency>

<dependency>
 <groupId>xml-apis</groupId>
 <artifactId>xml-apis</artifactId>
 <version>1.4.01</version>
 </dependency>

</dependencies>

Sample Java code:


package com.example.tests;

import java.util.regex.Pattern;
import java.util.concurrent.TimeUnit;
import org.junit.*;
import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;
import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.Select;

public class Google {
 private WebDriver driver;
 private String baseUrl;
 private boolean acceptNextAlert = true;
 private StringBuffer verificationErrors = new StringBuffer();

@Before
 public void setUp() throws Exception {
 driver = new FirefoxDriver();
 baseUrl = "https://www.google.co.in/";
 driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
 }

@Test
 public void testGoogle() throws Exception {
 driver.get(baseUrl);
 driver.findElement(By.id("lst-ib")).clear();
 driver.findElement(By.id("lst-ib")).sendKeys("java complete reference");
 driver.findElement(By.xpath("//ol[@id='rso']/li/div/h3/a/em[2]")).click();
 }

@After
 public void tearDown() throws Exception {
 driver.quit();
 String verificationErrorString = verificationErrors.toString();
 if (!"".equals(verificationErrorString)) {
 fail(verificationErrorString);
 }
 }

private boolean isElementPresent(By by) {
 try {
 driver.findElement(by);
 return true;
 } catch (NoSuchElementException e) {
 return false;
 }
 }

private boolean isAlertPresent() {
 try {
 driver.switchTo().alert();
 return true;
 } catch (NoAlertPresentException e) {
 return false;
 }
 }

private String closeAlertAndGetItsText() {
 try {
 Alert alert = driver.switchTo().alert();
 String alertText = alert.getText();
 if (acceptNextAlert) {
 alert.accept();
 } else {
 alert.dismiss();
 }
 return alertText;
 } finally {
 acceptNextAlert = true;
 }
 }
}

Tech Co-founder – Watch before you leap.

August 22, 2013 bigdatanerdLeave a comment

Have you heard many times, I’ve this killing idea; I just need a technical co-founder, Will you jump in to the ship for a hard voyage? Then keep read on.

Keep that in your mind, the idea is just a part of successful product. May be you can say it will contribute 5% of a product. When someone says “this is unique idea, don’t tell anyone”, he should be the most stupid you ever faced. Just carry on with your business.

There is nothing called unique in Idea. There were lots of search engines before and after Google. There were lots of social networks before and after Facebook. (Remember LinkedIn Started an year before Facebook). The idea is not your product, but the ability to sustain and keep innovates will yield your product. In short it is all about implementation. Remember the tech co-founder is that implementation guy.

Before you jump in to be part of any voyage, here are some of the checklists for you.

Ask these following questions.

What the “Idea” guy bring in to the table?

The non-tech co-founder should have the following skills.

Strong domain understanding and the ability to innovate on their domain.
Should have strong connection and network with in his domain
Strong Sales and Marketing experience. (This is the must)
Strong vision about his product and what he wants to take the product in next three years.
Ability to understand and respect technical difficulties
Strong user experience sense and ability to create wireframe

What is that “idea?”

Do you feel the idea is sustainable? As a user or a potential user, will you use it?
Does the idea solve any real world problems?
How realistic the idea is?
Will your life become lot better with that product?

What is the benefit for me?

You are my tech co-founder, CTO. You will hold 2-3 % stack in the company. Just get lost. There is no point of to be in part of it with those marginal profits and spending your valuable time. Instead you could solve challenging problem in kaggle.com or help people in stackoverflow.com
Are you getting a good pay for your consulting. Does that money worth of your time. If not keep move on.

If your checklist gives negative marks all the way through, There are enough technical challenges to be solved in the world to better the human life. Just don’t waste your time.

Continuous code re-factoring – part-2 Unit Testing on the container, the Arquillian way

August 9, 2013August 9, 2013 bigdatanerd3 Comments

The Problem:

JUnit is probably one of the Top 5 Java open source tool developed by Java community. Until the IOC and Dependency Injection comes in to picture it is all well. But the moment containers taken care of injecting objects for you, the product development became lot easier. Now the programmers just need to describe how the Object should get created, your container will take care of injecting your Object. But on the downside, it makes the unit testing lot harder. To unit test your methods, one need to mock all the Objects that are needed to test a particular method. One of the most common Object is your EntityManager. It is the whole lot of time-consuming effort to mock all the Objects to do your Unit Testing that will eventually consume your development time. Developers tend to move away from writing unit test cases, which will affect the code coverage. The lesser the code coverage, higher the code breaks. So your continuous code re-factor is in huge risk.

The saviour – Arquillian:

Arquillian is an open source test framework from JBoss. It integrate with JUnit and make your existing test cases to run on Arquillian container. The way Arquillian works is instead of mock the dependent Objects, Describe what are the Objects are dependent to unit test your class. Arquillian bundle the dependent classes and create a WAR and deploy in to the JBoss server and runs the unit test on your actual JBoss server.

The Advantage :

1. Developers no need to mock the Objects.

2. We are actually testing container injected Objects, just like your code will run after you deploy the WAR file.

3. Since only required classes been getting deployed, you are not deploying the entire WAR file. So the testing is faster.

4. It takes the pain out from the developers and increase the development time.

The Hello World:

Lets see how we can run a simple Hello World test case that will inject the Entity Manager. I’m taking the Entity Manager as an example because it is a common use case and I found very less resource on the web as well.

Step 1:

Configure your JBoss Home on your environment variable.

JBOSS_HOME = “<jboss home directory>

Step 2:

Maven dependency configuration


<properties> <version.shrinkwrap.resolvers>2.0.0-beta-5 </version.shrinkwrap.resolvers> </properties>


<dependency>

<groupId>org.jboss.as</groupId>
 <artifactId>jboss-as-arquillian-container-managed</artifactId>
 <version>7.1.1.Final</version>
 <scope>test</scope>
 <exclusions>
 <exclusion>
 <artifactId>org.apache.felix.resolver</artifactId>
 <groupId>org.apache.felix</groupId>
 </exclusion>
 </exclusions>
 </dependency>

<dependency>
 <groupId>org.apache.felix</groupId>
 <artifactId>org.apache.felix.resolver</artifactId>
 <version>1.0.0</version>
 </dependency>
 <dependency>
 org.jboss.arquillian.protocol
 <artifactId>arquillian-protocol-servlet</artifactId>
 <version>1.0.4.Final</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.arquillian.junit</groupId>
 <artifactId>arquillian-junit-container</artifactId>
 <version>1.0.4.Final</version>
 <scope>test</scope>
 </dependency>

<dependency>
 <groupId>org.jboss.spec</groupId>
 <artifactId>jboss-javaee-web-6.0</artifactId>
 <version>3.0.2.Final</version>
 <type>pom</type>
 <scope>provided</scope>
 <exclusions>
 <exclusion>
 <groupId>xalan</groupId>
 <artifactId>xalan</artifactId>
 </exclusion>
 </exclusions>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-api</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-spi</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-api-maven</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-spi-maven</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-impl-maven</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>
 <dependency>
 <groupId>org.jboss.shrinkwrap.resolver</groupId>
 <artifactId>shrinkwrap-resolver-impl-maven-archive</artifactId>
 <version>${version.shrinkwrap.resolvers}</version>
 <scope>test</scope>
 </dependency>

<dependency>
 <groupId>xml-apis</groupId>
 <artifactId>xml-apis</artifactId>
 <version>1.4.01</version>
 </dependency>

<dependency>
 <groupId>mysql</groupId>
 <artifactId>mysql-connector-java</artifactId>
 <version>5.1.21</version>
 </dependency>

Step 3:

On your src/test/resources, create a XML file arquillian.xml

</pre>
<?xml version="1.0" encoding="UTF-8"?>
xmlns="http://jboss.org/schema/arquillian"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://jboss.org/schema/arquillian
 http://jboss.org/schema/arquillian/arquillian_1_0.xsd">

<!-- <span class="hiddenSpellError" pre="">Uncomment</span> to have test archives exported to the file system for inspection -->
<!-- <engine> -->
<!-- <property name="deploymentExportPath">target/</property> -->
<!-- </engine> -->

<!-- Force the use of the <span class="hiddenSpellError" pre="the ">Servlet</span> 3.0 protocol with all containers, as it is the most mature -->
 <defaultProtocol type="Servlet 3.0" />

 <container qualifier="jboss">
 <protocol type="jmx-as7">
 <property name="executionType">REMOTE</property>
 </protocol>
 </container>
</arquillian>
<pre>

Step 4:
On your src/test/resources/META-INF create a file test-persistance.xml like,


</pre>
<?xml version="1.0" encoding="UTF-8"?>
<persistence version="2.0"
 xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="
 http://java.sun.com/xml/ns/persistence
 http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">

<persistence-unit name="stats-demo" transaction-type="JTA">
 <provider>org.hibernate.ejb.HibernatePersistence</provider>
 [Your JTA Data source configured in standalone.xml | domain.xml]
 <properties>
 <property name="hibernate.dialect" value="org.hibernate.dialect.MySQLDialect" />
 <property name="hibernate.show_sql" value="true" />
 <property name="hibernate.format_sql" value="true"/>
 <property name="use_sql_comments" value="true"/>
 <property name="hibernate.connection.provider_class"
 value="org.hibernate.connection.DatasourceConnectionProvider" />
 <property name="transaction.factory_class"
 value="org.hibernate.transaction.JTATransactionFactory" />
 <property name="hibernate.cache.provider_class" value="org.hibernate.cache.HashtableCacheProvider" />
 </properties>
 </persistence-unit>

</persistence>
<pre>

Step 5:
We have completed all the XML configuration to start with Arquillian. Now the JUnit class to test is,

</pre>
package com.ananth.learning.unittest.arquillian;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import javax.inject.Inject;
import javax.persistence.EntityManager;

import org.jboss.arquillian.container.test.api.Deployment;
import org.jboss.arquillian.junit.Arquillian;
import org.jboss.shrinkwrap.api.Archive;
import org.jboss.shrinkwrap.api.ShrinkWrap;
import org.jboss.shrinkwrap.api.asset.EmptyAsset;
import org.jboss.shrinkwrap.api.spec.WebArchive;
import org.jboss.shrinkwrap.resolver.api.maven.Maven;
import org.junit.Test;
import org.junit.runner.RunWith;
/**
* @RunWith make sure the test case running as a Arquillian test case
*/

@RunWith(Arquillian.class)
public class HelloWorld {

 @Deployment
 public static Archive<!--?--> createDeployment() {
 // Add this if you want to add any lib that been mentioned as a dependency in your maven configuration (pom.xml)
 File[] libs = Maven.resolver().loadPomFromFile("pom.xml")
 .resolve(getDependencyLibs()).withTransitivity().asFile();

 //System.out.println(libs);

 /** Add dependent classes and libs in to Web Archive.
 * please check the API lib, you can addClass() to add single class
 * addClasses() and addPackage(), addPackages() also works work you.
 * Note: you don't need to have beans.xml. It will be created by JBoss by itself.
 */
 WebArchive jar = ShrinkWrap
 .create(WebArchive.class, "test.war")
 .addClass(HelloWorld.class)
 .addAsLibraries(libs)
 .addAsResource("META-INF/test-persistence.xml",
 "META-INF/persistence.xml")
 .addAsManifestResource(EmptyAsset.INSTANCE, "beans.xml");

//System.out.println(jar.toString(true));

return jar;
 }
 // let the container inject for you
 @Inject
 private EntityManager em;

 /**
 * Your actual test case
 */
 @Test
 public void testHello() {
 System.out.println(em.toString());

 }

/**
 * list of dependency module from pom.xml that you may need. (e-g) your common module or persistance module
 */

private static List getDependencyLibs() {
 List<String> list = new ArrayList<>();
 list.add("mysql:mysql-connector-java");
 return list;
 }

}
<pre>

Now all you need to do run the test case just like how you run your normal JUnit test. You can see the WAR getting generated and unit test run on your JBoss Server.!!!