Data Science with Spark

This practical hands-on tutorial covers the fundamentals of Spark needed to get grips with Data Science through a single data set. It expands on the next learning curve for those comfortable with Spark programming who are looking to apply Spark in the field of Data Science.

  • Comprehensive training through 40 video sessions.
  • Understand the Spark programming language and its ecosystem of packages in Data Science
  • Understand the Spark machine learning algorithm to build a simple pipeline
  • Apply data mining techniques on the available data sets


    batch loading...

    Course Features

    About Data Science with Spark

    The real power and value proposition of Apache Spark is its speed and platform to execute Data Science tasks. Spark's unique use case is that it combines ETL, batch analytic, real-time stream analysis, machine learning, graph processing, and visualizations to allow Data Scientists to tackle the complexities that come with raw unstructured data sets. Spark embraces this approach and has the vision to make the transition from working on a single machine to working on a cluster, something that makes data science tasks a lot more agile.

    In this course, you’ll get a hands-on technical resource that will enable you to become comfortable and confident working with Spark for Data Science. We won't just explore Spark’s Data Science libraries, we’ll dive deeper and expand on the topics.

    This course starts by taking you through Spark and the needed steps to build machine learning applications. You will learn to collect, clean, and visualize data coming from Twitter with Spark streaming. Then, you will get acquainted with Spark Machine learning algorithms and different machine learning techniques. You will also learn to apply statistical analysis and mining operations on our Tweet dataset. Finally, the course will end by giving you some ideas on how to perform awesome analysis including graph processing. By the end of the course, you will be able to do your Data scientist job in a very visual way, comprehensive and appealing for business and other stakeholders.

    Course Objectives
    • Understand the Spark programming language and its ecosystem of packages in Data Science
    • Obtain and clean data before processing it
    • Understand the Spark machine learning algorithm to build a simple pipeline
    • Work with interactive visualization packages in Spark
    • Apply data mining techniques on the available data sets
    • Build a recommendation engine
    Module 1:

    Your Spark And Visualization Toolkit

    • The Course Overview
    • Spark: Origins and Ecosystem for Big Data Scientists, the Scala, Python, and R flavors
    • Install Spark on Your Laptop with Docker, or Scale Fast in the Cloud
    • Apache Zeppelin, a Web-Based Notebook for Spark with matplotlib and ggplot2
    Module 2:

    Your Next Data Challenges

    • Manipulating Data with the Core RDD API
    • Using Dataframe, Dataset, and SQL – Natural and Easy!
    • Manipulating Rows and Columns
    Module 3:

    First Steps With Spark Visualization

    • Discovering and spark.mllib - and Other Libraries
    • Wrapping Up Basic Statistics and Linear Algebra
    • Cleansing Data and Engineering the Features
    • Reducing the Dimensionality
    Module 4:

    Collecting And Cleansing The Dirty Tweets

    • Streaming Tweets to Disk
    • Streaming Tweets on a Map
    • Cleansing and Building Your Reference Dataset
    Module 5:

    Statistical Analysis On Tweets

    • Indicators, Correlations, and Sampling
    • Validating Statistical Relevance
    • Running SVD and PCA
    Module 6:

    Extracting Features From The Tweets

    • Analyzing Free Text from the Tweets
    • Dealing with Stemming, Syntax, Idioms and Hashtags
    • Detecting Tweet Sentiment
    Module 7:

    Mine Data And Share Results

    • Word Cloudify Your Dataset
    • Locating Users and Displaying Heatmaps with GeoHash
    • Collaborating on the Same Note with Peers
    Module 8:

    Classifying The Tweets

    • Building the Training and Test Datasets
    • Training a Logistic Regression Model
    • Evaluating Your Classifier
    Module 9:

    Clustering Users

    • Clustering Users by Followers and Friends
    • Clustering Users by Location
    • Running KMeans on a Stream
    Module 10:

    Your Next Data Challanges

    • Recommending Similar Users
    • Analyzing Mentions with GraphX
    • Where to Go from Here

    Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer (, a social network for Data Scientists. He is passionate about using software and mathematics to help companies get insights from data.

    His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing.

    He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events. He can be contacted on Twitter on @echarles.


    A test will be conducted at the end of the course. On completion of the test with a minimum of 70% marks, will issue a certificate of successful completion from NIIT.

    Five re-attempts will be provided in case the candidate scores less than 70%.

    A Participation certificate will be issued if the candidate does not score 70% after five attempts.


    Preferably Math or Statistics subjects learnt at school and college level.

    Database, MIS and data analysis would help a lot in easy and quick learning.


    Who should go for this Course?

    Software professional aspiring career in Data Science to use data analytical ability for interpreting rich data stores.

    Data visualisation and statistics aspirants with keen interest in programming can also join this course to leverage upon the learning for promising career growth.

    Where can I find my session schedule?

    The session schedule will be available in the Student portal - Learning Plan section. You can login to your account to view the same.

    What is your refund policy?

    Upon registering for the course, if for some reason you are unable or unwilling to participate in the course further, you can apply for a refund. You can initiate the refund any time before start of the second session of the course by sending an email to , with your enrolment details and bank account details (where you want the amount to be transferred). Once you initiate a refund request, you will receive the amount within 21 days after confirmation and verification by our team. This is provided if you have not downloaded any courseware after registration.

    Why is it called Self Paced course?

    Self Paced courses are comprised of several learning videos into a course structure broken down into Learning Modules and Sessions. The learner is required to go through the videos topic-wise in the structure sequence of the course to learn the concepts. Being Self Paced, there is no intervention of any external faculty or additional mentor in learning.

    Being a self paced course, how will my attendance be tracked and marked?

    you login into your account to watch the videos, attendance for it will be marked automatically.

    How will the assessment be conducted for my certification?

    After each module, a multiple choice questions type online assessment will be conducted. 5 Attempts will be allowed for the assessment to be completed. The minimum pass percentage for each assessment is 70%. On successfully clearing the assessment, a verified certificate from NIIT shall be awarded otherwise the certificate of participation will be issued.

    What are the minimum system requirements to attend the course?

      Minimum system requirements for accessing the courses are:

    • Personal computer or Laptop with web camera
    • Headphone with Noise Clarity Microphone
    • Broadband connection with minimum bandwidth of 4 mbps.
    • Its recommended to use System Health Check to examine the OS details, Add in, Plugins, Camera, Mic and other external devices.

    Is there an official support desk for technical guidance during the training program?

    Yes.For immediate technical support during the live online classroom sessions, you can call 91-9717992809 or 0124-4917203 between 9:00 AM and 8:00 PM IST. You can write to for all other queries and our team will be happy to help you.

    Course Features

    batch loading...

    Related Courses

    AI and Deep Learning with TensorFlow
    AWS Certification and Training Program
    Active Directory® Services with Windows Server®
    Administering Microsoft Exchange Server 2016
    Administering Microsoft® SQL Server® 2014 Databases
    Administering System Center Configuration Manager and Intune
    Administering Windows Server® 2012
    Administering the Web Server IIS Role of Windows Server
    Administration Essentials for New Admins- Salesforce
    Advanced Automated Administration with Windows PowerShell®
    Advanced Data Mining projects with R
    Advanced Pay Per Click
    Advanced Social Media Marketing
    Advanced Solutions of Microsoft Exchange Server 2013
    Advanced Solutions of Microsoft® SharePoint® Server 2013
    Analyzing Data with Power BI
    Analyzing and Visualizing Data with Excel
    Analyzing and Visualizing Data with Power BI
    Android Game Development for Beginners
    Angular 5
    Application Development with Swift 2
    Automated UI Testing in Java
    Automating Administration with Windows PowerShell®
    Big Data Analytics with R
    Big Data Applications using Hadoop
    Building Android Games with OpenGL ES
    Building Applications with Ext JS
    Building Applications with
    Building a Data Mart with Pentaho Data Integration
    Building iOS 10 Applications with Swift
    Builiding web application with spring MVC
    Business Analytics using R from KPMG
    Business Analytics using R from KPMG – Advanced
    Business Analytics using R from KPMG – Expert
    Certified Digital Marketing Professional
    Cloud and Datacenter Monitoring with System Center Operations Manager
    Complete Web and Social Media Analytics
    Configuring Advanced Windows Server® 2012 Services
    Core Solutions of Microsoft® Exchange Server 2013
    Core Solutions of Microsoft® SharePoint® Server 2013
    Core Solutions of Skype for Business 2015
    Data Quality 9.x: Developer, Level 1
    Data Science Orientation
    Data Science with R
    Data Science with Spark
    Deploying Windows Desktops and Enterprise Applications
    Designing and Deploying Microsoft Exchange Server 2016
    Designing and Implementing a Server Infrastructure
    DevOps Certification Training
    Developing Microsoft Azure Solutions
    Developing Microsoft SharePoint® Server 2013
    Developing Microsoft SharePoint® Server 2013 Core Solutions
    Developing SQL Databases
    Enabling and Managing Microsoft Office 365
    Fundamentals of a Windows Server® Infrastructure
    GNIIT Foundation
    Getting Started with R for Data Science
    Getting started with Apache Solr Search Server
    IBM Cognos Connection and Workspace Advanced
    IT Service Management with System Center Service Manager
    Implementing Microsoft Azure Infrastructure Solutions
    Implementing Microsoft Azure Solutions-70-533
    Implementing a Data Warehouse with Microsoft® SQL Server® 2014
    Informatica PowerCenter 9.x Level 1
    Installing and Configuring Windows 10
    Installing and Configuring Windows Server® 2012
    Introducing Rails 5 Learning Web Development the Ruby Way
    Introduction to ITIL
    Introduction to SQL Databases
    Introduction to Web Development with Microsoft Visual Studio 2010
    Java Enterprise Apps with DevOps
    Joomla Certification Training Program
    Julia for Data Science
    Learning Android N Application Development
    Learning Data Mining with R
    Learning Joomla 3 Extension Development
    Learning MongoDB
    Learning R for Data Visualization
    Learning Spring Boot
    Learning Swift 2
    Linux shell scripting solution
    Machine Learning with Python
    Marketing Analytics Data Tools and Techniques
    Mastering Magento
    Open Source Web App Development using MEAN Stack
    PMI® Agile Certified Practitioner Training
    Pentaho Reporting
    Performance Tuning and Optimizing SQL Databases
    Planning and Deploying System Center 2012 Configuration Manager
    Programming Using Python
    Programming in C Sharp
    Programming in HTML5 with JavaScript and CSS3
    Programming with Python for Data Sciences.
    Project Management Professional (PMP®) Training
    Querying Data with Transact SQL
    Querying Microsoft SQL Server® 2014
    R Data Mining Projects
    R for Data Science Solutions
    Reactive Java 9
    SAS Certification Training Program
    Secrets of Viral Video Marketing
    Selenium with Java
    Six Sigma Certification Training Program
    Spring Security
    Supply Chain Management(SCM) Training Program
    Supporting and Troubleshooting Windows 10
    Teradata Certification Training
    Test Driven Android
    UNIX Shell Scripting Training
    Upgrading Your Skills to MCSA Windows Server 2016
    Upgrading Your Skills to MCSA Windows Server® 2012
    Web Apps Development using Node.js along with Express.js and MongoDB
    Web Apps Development with HTML5, CSS3, jQuery & Bootstrap
    Web Development with Node.JS and MongoDB
    iOS App Development Certification Training
    jquery UI Development