Hadoop versus VPN - No Fluff Just Stuff

Hadoop versus VPN

Posted by: Michael Nygard on July 31, 2009

I've been doing some work with Hadoop lately, and I just ran into an interesting problem with networking. This isn't a bug, per se, but a conflict in my configuration.

I'm running on a laptop, using a pseudo-distributed cluster. That means all the different processes are running, but they're all running on one box. That makes it possible to test jobs with full network communication, but without deploying to a production cluster.

I'm also working remotely, connecting to the corporate network by VPN. As is commonly done, our VPN is configured to completely separate the client machine from its local network. (If it didn't, you could use the VPN machine to bridge the secure corporate network to your home ISP, coffeeshop, airport, etc.)

Here's the problem: when on the VPN, my machine can't talk to its own IP address. Right now, ifconfig reports the laptops IP address as 192.168.1.105. That's the address associated with the physical NIC on the machine.

The odd part is that Hadoop mostly works this way. I've configured the name node, job tracker, task tracker, datanodes, etc. to all use "localhost". I can use HDFS, I can submit jobs, and all the map tasks work fine. The only problem is that when the map tasks finish, the task tracker cannot send data from the map tasks to the reduce tasks. The job appears to hang.

In the task tracker's log file, I see reports every 20 seconds or so that say

2009-07-31 11:01:33,992 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200907310946_003_r_000000_0 0.0% reduce > copy >

The instant I disconnected from the VPN, the copy proceeded and the reduce job ran.

I'm sure there's a configuration property somewhere within Hadoop that I can change. When (if) I find it, I'll update this post.

Michael Nygard

About Michael Nygard

Michael strives to raise the bar and ease the pain for developers across the country. He shares his passion and energy for improvement with everyone he meets, sometimes even with their permission. Michael has spent the better part of 20 years learning what it means to be a professional programmer who cares about art, quality, and craft. He's always ready to spend time with other developers who are fully engaged and devoted to their work–the “wide awake” developers. On the flip side, he cannot abide apathy or wasted potential.

Michael has been a professional programmer and architect for nearly 20 years. During that time, he has delivered running systems to the U. S. Government, the military, banking, finance, agriculture, and retail industries. More often than not, Michael has lived with the systems he built. This experience with the real world of operations changed his views about software architecture and development forever.

He worked through the birth and infancy of a Tier 1 retail site and has often served as “roving troubleshooter” for other online businesses. These experiences give him a unique perspective on building software for high performance and high reliability in the face of an actively hostile environment.

Most recently, Michael wrote “Release It! Design and Deploy Production-Ready Software”, a book that realizes many of his thoughts about building software that does more than just pass QA, it survives the real world. Michael previously wrote numerous articles and editorials, spoke at Comdex, and co-authored one of the early Java books.

Why Attend the NFJS Tour?

  • » Cutting-Edge Technologies
  • » Agile Practices
  • » Peer Exchange

Current Topics:

  • Languages on the JVM: Scala, Groovy, Clojure
  • Enterprise Java
  • Core Java, Java 8
  • Agility
  • Testing: Geb, Spock, Easyb
  • REST
  • NoSQL: MongoDB, Cassandra
  • Hadoop
  • Spring 4
  • Cloud
  • Automation Tools: Gradle, Git, Jenkins, Sonar
  • HTML5, CSS3, AngularJS, jQuery, Usability
  • Mobile Apps - iPhone and Android
  • More...
Learn More »