We are sorry, This PDF is available in download format only

Sponsored by

>> Computing View Point SimplifyingHadoop July2013 Thegapbetweenthepotentialpowerof Hadoopandthetechnicaldifficultiesin itsimplementationarenarrowing –andabouttimetoo SimplifyingHadoop Contents ThepromiseandchallengeofHadoop............................................................................. p3 TypicalHadoopimplementationscan addcostandcomplexity.......................................................................................................................................... p4 Theimportanceofinfrastructure.................................................................................................. p5 Hadoopneedstobesimpletodeploy................................................................................ p7 DeliveringonthepromiseofHadoop................................................................................. p8 Aboutthesponsor,Intel........................................................................................................................................... p9 ©ThisdocumentispropertyofIncisiveMedia.Reproductionanddistributionofthispublicationinany formwithoutpriorwrittenpermissionisforbidden. 2 ComputingViewPoint SimplifyingHadoop ThepromiseandchallengeofHadoop Hadoopprovidesascalable,cost-effectiveandflexibleframeworktointegrateand enabletheanalysisofthevastamountsofunstructuredandsemi-structureddata availablewithinmanyorganisations. Withupto80percentofcorporatedataexistinginunstructuredformatsinsome organisations,Hadoopoffersaprovenroutetovastlyexpandingthepowerofan organisation’sdataanalysis. Hadoopisanopensourcedataprocessingtechnologythattakesadvantageoflarge clustersofindustry-standardserverstocreateasingle,reliableandhighlyextensible environmentcapableofstoringandmanagingpetabytesofinformation. ButconventionalimplementationsofHadoopcomewithacost:Hadoopplacesa numberofchallengesonitscomputeinfrastructure,notablyonCPUperformance andI/Oandstoragethroughput.UsingexistinglegacyinfrastructureforaHadoop deploymentmaysignificantlylimitpotentialperformance. Forthisreason,aswellastheinherentcomplexityofinstalling,tuningandoperating Hadoop,thispaperrecommendsthatwhereverpossiblemodernserver,storageand networkinginfrastructurebeprovisionedforthepurpose,andthatadistribution beselectedthatisdesignedbothtosimplifythemanagementandoptimisethe performanceofthisinvestment. ComputingViewPoint 3 SimplifyingHadoop TypicalHadoopimplementationscanaddcostandcomplexity Hadoopenablesdistributedprocessingofverylargedatasetsacrossclusters ofindustry-standardservers.Itcanscalefromasingleservertothousandsof machines.Theendresultisasingleenvironmentbuiltontopofmultitudesof individualprocessors,abletoaccesshundredsorthousandsofdatastoragedevices, thatcananalysedisparateandheterogeneousdatadrawnfromanynumberof sourcesanddatatypes. However,tosaythatHadoopcanbedeployedonindustry-standardserversisnot thesamethingassayingthatitwillrunefficientlyonjustanyhardwarethatmight betohand.UtilisingexistinginfrastructureforaHadoopdeploymentcouldwelllimit itsperformanceandefficiency,andultimatelyconstraintherangeandtimelinessof resultsderivedfromaHadoopimplementation. “TosaythatHadoopcanbe deployedonindustry-standard serversisnotthesamethingas sayingthatitwillrunefficiently onjustanyhardwarethatmight betohand” Thisisbecauseforthehardwareplatform,thedistributedcomputemodelcanpose anumberofchallenges. Hadoopcandemandmassivelevelsofparallelprocessing,witheachprocessor havingtodealwithveryhighnumbersofmulti-threadedoperations.Witholder processorsystemsthiscanaffectperformance,ultimatelycausinglongleadtimes foranalysisandreporting. Hadoopalsomakesuseofhighlydistributeddatastorage.Asaresult,storageI/O canpresentasignificantchallengeinHadoopinstallationsbuiltonlegacyorprevious generationtechnologyplatforms.StorageI/Olatenciescanalsocreatebottlenecks inprocessingresultsandqueries. Asahighlydistributedenvironment,potentiallyspanningthousandsofprocessors andstoragedevices,Hadoopiscruciallydependentonnetworkspeedsbetweeneach distributedprocessor.Slownetworkingspeedscanreducetheperformanceofthe overallHadoopenvironmentsignificantly. Comparedtomodernhigh-performancetechnologiesincorporatedinthelatest generationofserversandnetworkingsystems,choosingolderplatformtechnologies tobuildHadoopinstallationscansignificantlydriveupthecostofrunningHadoop, andreduceboththebusinessfinancialandoperationalreturnsfrominvestment. Whilethecoresoftwaremaybefreetodownload,aproductionHadoopinstallation requiressignificantinputintermsoftimeandskills.Itmaybetemptingtobring 4 ComputingViewPoint SimplifyingHadoop aginghardwarebackintoserviceforthepurpose,butforadeploymentofanyscale thisislikelytobeafalseeconomy. Moderntechnologyplatformsaremuchmoreabletocopewiththedemandsplaced onthembyHadoop,resultinginsignificantperformanceimprovements. Byincorporatingnewtechnologiestodeliversuperiormulti-threadedandsingle- threadedprocessing,anddynamicboostingofprocessingacrossmultiplecores,the latestgenerationofprocessorscandeliverupto50percenthigherperformancethan thepreviousgeneration. ImprovementsinstorageI/Operformance,increasingavailabilityofcost-competitive solidstatedrivedevicesanduseofcacheaccelerationsoftwarecanallofferfurther performanceimprovementsintheHadoopenvironment.UsingSSDtechnologies togetherwithcacheaccelerationsoftwarecanonitsowndeliverimprovementsof 80percentforsomeHadoopworkloads. Advancesinnetworkingtechnologyarealsodrivingdowncostsanddelivering furtherimprovementstoHadoopprocessing.With10GigabitEthernetnowavailable atacostpergigabitbandwidthlessthan1GbEnetworking,itnotonlymakesit financiallyalogicalcase,butalsocanimproveoverallHadoopdataexportand processingtimesbyupto80percent. Thevolume,complexityandheterogeneityofthedatathatcananalysedisalso greatlyincreased,andthetimelinessofreportingsignificantlyimproved.Thebottom lineisthatbusinessunitscanachievemoreaccurateresultsfromagreaterrangeof data,withintime-scalesthatmakeHadoopprocessingpracticalformanydifferent typesofbusinessdecision-making. Theimportanceofinfrastructure ToexploittothefullthepotentialofHadooptomanageandanalysethelargestBig Datasetstoprovidebothdeepandtimelyresults,organisationsneedtoimplement thelatestgenerationofindustry-standardserversystems,optimisedforthe requirementsofHadoopBigDataprocessing.ThechoiceofHadoopdistributionis importanttoo. “ExploitingBigDatafullyrequires thelatesthardwareandsoftware, workingintandem” AlthoughthecorestackofHadoopsoftwareisreadilyavailablefromanumberof distributions,itshouldbenotedthatnotallhavebeenoptimisedfortheunderlying infrastructure.Aswehaveseenthechoiceofprocessors,networkingandstoragecan makeabigdifferenceinthewaythatHadoopperforms,and,inparticular,forlarge scaleorbusiness-dependentapplications,selectingadistributionthathasbeen fine-tunedtogetthebestoutofmodernhardwarewillenableanorganisationtoget themostoutofitsinvestment. ComputingViewPoint 5 SimplifyingHadoop Buildingonoptimisationstotheopensourcesoftwarestacktoexploitfully advancesinthelatestprocessors,10GbEnetworkingandSSDstorageutilised inlatestgenerationservers,providesasignificantstepforwardinHadoop performance,manageabilityandcost,andloweringthebarrierstoHadoop adoption. “Anoptimisedinstallationof Hadoopwillresultnotonlyina higherperformanceinstallation, butwillsavesignificantlyin hardware,spaceandenergycosts.” AnoptimisedinstallationofHadoopwillresultnotonlyinahigherperformance installation,butwillsavesignificantlyinhardware,spaceandenergycosts. Hadoophasdozensofperformancerelevantconfigurationparametersthatcan affectthespeedofqueriesandreporting.Ensuringthateachoftheseparameters hasbeensettoensureoptimumperformanceforeachapplicationsittingontopof Hadoopcanbeextremelychallenging,especiallyatscale.However,distributionsthat havebeendesignedintandemwiththeunderlyinghardwareremovemanyofthese steps,byincorporatingsoftwarethatautomatesthetaskoftuning.Automated tuningcanimproveperformancesignificantlyandreducelatency. 6 ComputingViewPoint SimplifyingHadoop Hadoopneedstobesimpletodeploy WhileHadoopitselfandmanyHadoopdistributionsarefree,thefullcostsof deployingitareliabletobeunderestimated. “Thefullcostsofdeploying Hadoopareliabletobe underestimated” ThisisbecauseHadoopclusters–thecollectionsofcommodityserversacrosswhich dataisdistributed-canbecomplextodeployanddifficulttomanage.Thesoftware stackrequiresLinux,Java,andsometimesahypervisoroneachservernode.Hadoop isakernelofseveralsoftwarecomponentscomprisingtheplatformthatare installedandconfiguredseparately. AllthiscandriveupthecostofHadoop,andreducetime-to-valueandROIon Hadoopinvestments. Organisationsneedtechnologies,configurationsandmethodologiesthatsimplify theimplementationandmanagementofHadoop.Fortunatelytoolsarenow becomingavailablethateasetheburdenontheITteam. AutomatedHadoopmanagementsoftwarecanreducethetimeandcomplexityof Hadoopdeploymentsbyprovidingaconsole-basedfrontendthatgivesvisibility overthewholesystem,makingoperationalandadministrativetaskssuchas authentication,clustermanagementandtrackingtheutilisationofresourcesmuch simpler.Tasksthatpreviouslyhadtobeperformedmanually,suchasperformance tuningarenowautomated.Suchmanagementconsolesnotonlyreducethetime andcostofHadoopdeployment,buttypicallyachievesignificantperformance optimisation... Inthecomplexenvironmentofatypicalproduction-scaleset-up,keepingHadoop fullytunedcanalsobeachallenge.Oncelaunchedintoaproductionenvironment, Hadoopimplementationstypicallyevolveandchangeveryrapidly.Automated Hadoopmanagementsoftwarecancarryoutcomprehensivesystem-wide monitoringandlogging,aswellasproactivehealthchecksacrossaHadoopcluster, ensuringon-goingperformanceoptimisationandsystemstuning. ComputingViewPoint 7 SimplifyingHadoop DeliveringonthepromiseofHadoop Anindustry-standardenterprise-classHadoopstackwithhigh-performance computeandI/Owouldopenupnewapplications,especiallyinareassuchas finance,healthcareandsecurity. SuchistherapidlyevolvingnatureofBigDatathatnon-optimal,last-generation serversystemandmanagementtechnologieswilllimittheabilitytoscalewith increasingdatademands,resultinginnon-optimalsolutionsandreducingthevalue thatcanbeobtainedfromthedatainatimelyfashion. OrganisationsshouldrecogniseBigDataaspartoftheirenterprisecomputing landscape,andlookforfullyoptimisedandsupportedsolutionsfrommainstream businessapplicationandtechnologyproviders,andlooktooptimiseandintegrate Hadoopfullywiththeirexistingbusinessapplicationscomputinginfrastructure. 8 ComputingViewPoint SimplifyingHadoop Aboutthesponsor,Intel TheIntelDistributionforApacheHadoopsoftwareistheonlydistributionbuiltfrom siliconuptoenablethewidestrangeofdataanalysisonApacheHadoop.Itisthe firstwithhardware-enhancedperformanceandsecuritycapabilities.Itistheonly opensourceplatformforbigdatawithsupportfromaFortune100company.Intel iscommittedtodevelopingaplatformonwhichtheentireecosystemcanbuild next-generationanalyticssolutions. ContactIntel Visit: http://hadoop.intel.com http://intel.co.uk/bigdata ComputingViewPoint 9 Read the full Sponsored by.

Related Videos