The real power and value proposition of Apache Spark is its speed and platform to execute Data Science tasks. Spark's unique use case is that it combines ETL, batch analytic, real-time stream analysis, machine learning, graph processing, and visualizations to allow Data Scientists to tackle the complexities that come with raw unstructured data sets. Spark embraces this approach and has the vision to make the transition from working on a single machine to working on a cluster, something that makes data science tasks a lot more agile.
In this course, you’ll get a hands-on technical resource that will enable you to become comfortable and confident working with Spark for Data Science. We won't just explore Spark’s Data Science libraries, we’ll dive deeper and expand on the topics.
This course starts by taking you through Spark and the needed steps to build machine learning applications. You will learn to collect, clean, and visualize data coming from Twitter with Spark streaming. Then, you will get acquainted with Spark Machine learning algorithms and different machine learning techniques. You will also learn to apply statistical analysis and mining operations on our Tweet dataset. Finally, the course will end by giving you some ideas on how to perform awesome analysis including graph processing. By the end of the course, you will be able to do your Data scientist job in a very visual way, comprehensive and appealing for business and other stakeholders.
- Understand the Spark programming language and its ecosystem of packages in Data Science
- Obtain and clean data before processing it
- Understand the Spark machine learning algorithm to build a simple pipeline
- Work with interactive visualization packages in Spark
- Apply data mining techniques on the available data sets
- Build a recommendation engine
Your Spark And Visualization Toolkit
- The Course Overview
- Spark: Origins and Ecosystem for Big Data Scientists, the Scala, Python, and R flavors
- Install Spark on Your Laptop with Docker, or Scale Fast in the Cloud
- Apache Zeppelin, a Web-Based Notebook for Spark with matplotlib and ggplot2
Your Next Data Challenges
- Manipulating Data with the Core RDD API
- Using Dataframe, Dataset, and SQL – Natural and Easy!
- Manipulating Rows and Columns
First Steps With Spark Visualization
- Discovering spark.ml and spark.mllib - and Other Libraries
- Wrapping Up Basic Statistics and Linear Algebra
- Cleansing Data and Engineering the Features
- Reducing the Dimensionality
Collecting And Cleansing The Dirty Tweets
- Streaming Tweets to Disk
- Streaming Tweets on a Map
- Cleansing and Building Your Reference Dataset
Statistical Analysis On Tweets
- Indicators, Correlations, and Sampling
- Validating Statistical Relevance
- Running SVD and PCA
Extracting Features From The Tweets
- Analyzing Free Text from the Tweets
- Dealing with Stemming, Syntax, Idioms and Hashtags
- Detecting Tweet Sentiment
Mine Data And Share Results
- Word Cloudify Your Dataset
- Locating Users and Displaying Heatmaps with GeoHash
- Collaborating on the Same Note with Peers
Classifying The Tweets
- Building the Training and Test Datasets
- Training a Logistic Regression Model
- Evaluating Your Classifier
Clustering Users
- Clustering Users by Followers and Friends
- Clustering Users by Location
- Running KMeans on a Stream
Your Next Data Challanges
- Recommending Similar Users
- Analyzing Mentions with GraphX
- Where to Go from Here
Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer (http://datalayer.io/docker), a social network for Data Scientists. He is passionate about using software and mathematics to help companies get insights from data.
His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing.
He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events. He can be contacted on Twitter on @echarles.
A test will be conducted at the end of the course. On completion of the test with a minimum of 70% marks, training.com will issue a certificate of successful completion from NIIT.
Five re-attempts will be provided in case the candidate scores less than 70%.
A Participation certificate will be issued if the candidate does not score 70% after five attempts.
Preferably Math or Statistics subjects learnt at school and college level.
Database, MIS and data analysis would help a lot in easy and quick learning.
Who should go for this Course?
Software professional aspiring career in Data Science to use data analytical ability for interpreting rich data stores.
Data visualisation and statistics aspirants with keen interest in programming can also join this course to leverage upon the learning for promising career growth.
Where can I find my session schedule?
The session schedule will be available in the training.com Student portal - Learning Plan section. You can login to your training.com account to view the same.
What is your refund policy?
Upon registering for the course, if for some reason you are unable or unwilling to participate in the course further, you can apply for a refund. You can initiate the refund any time before start of the second session of the course by sending an email to support@training.com , with your enrolment details and bank account details (where you want the amount to be transferred). Once you initiate a refund request, you will receive the amount within 21 days after confirmation and verification by our team. This is provided if you have not downloaded any courseware after registration.
Why is it called Self Paced course?
Self Paced courses are comprised of several learning videos into a course structure broken down into Learning Modules and Sessions. The learner is required to go through the videos topic-wise in the structure sequence of the course to learn the concepts. Being Self Paced, there is no intervention of any external faculty or additional mentor in learning.
Being a self paced course, how will my attendance be tracked and marked?
you login into your training.com account to watch the videos, attendance for it will be marked automatically.
How will the assessment be conducted for my certification?
After each module, a multiple choice questions type online assessment will be conducted. 5 Attempts will be allowed for the assessment to be completed. The minimum pass percentage for each assessment is 70%. On successfully clearing the assessment, a verified certificate from NIIT shall be awarded otherwise the certificate of participation will be issued.
What are the minimum system requirements to attend the course?
- Personal computer or Laptop with web camera
- Headphone with Noise Clarity Microphone
- Broadband connection with minimum bandwidth of 4 mbps.
- Its recommended to use System Health Check to examine the OS details, Add in, Plugins, Camera, Mic and other external devices.
Minimum system requirements for accessing the courses are:
Is there an official support desk for technical guidance during the training program?
Yes.For immediate technical support during the live online classroom sessions, you can call 91-9717992809 or 0124-4917203 between 9:00 AM and 8:00 PM IST. You can write to support@training.com for all other queries and our team will be happy to help you.