Big Data Analytics | My Assignment Tutor

1SCHOOL OF ARCHITECTURE, COMPUTING &ENGINEERINGSubmission instructions• Cover sheet to be attached to the front of the assignment when submitted• Question paper to be attached to assignmentwhen submitted• All pages to be numbered sequentially• All work has to be presented in a ready to submit state upon arrival atthe ACE Helpdesk. Assignment cover sheets or stationery will NOT beprovided by Helpdesk staff Module codeCN7031Module titleBig Data AnalyticsModule leaderAmin KaramiAssignment tutorA Karami, F Jafari, MA Ghazanfar, N QaziAssignment titleBig Data Analytics: CourseworkAssignment number1Weighting100%Handout dateWeek 5 (30th October 2020)Submission datePresentation: Week 12 (14th-18th December 2020)Turnitin Submission: 25th December 2020 (midnight)Learning outcomesassessed by thisassignment1-8Turnitin submissionrequirementYesTurnitin GradeMark feedback used?NoUEL Plus Grade Booksubmission used?NoUEL Plus Grade Book feedback used?NoOther electronicsystem used?YesAre submissions / feedback totally electronic?YesAdditional information 2Form of assessment:Individual work Group workFor group work assessment which requires members to submit both individualand group work aspects for the assignment, the work should be submitted as:Consolidated single document Separately by eachmemberNumber of assignment copies required:1 2 OtherAssignment to be presented in the following format:On-line submissionStapled once in the top left-hand cornerGlue boundSpiral boundPlaced in a A4 ring bound folder (not lever arch)Note: To students submitting work on A3/A2 boards, work has to becontained in suitable protective case to ensure any damage to work isavoided.Soft copy:CD (to be attached to the work in an envelope or purpose made walletadhered to the rear)USB (to be attached to the work in an envelope or purpose made walletadhered to the rear)Soft copy not required3CN7031 – Big Data AnalyticsGroup assignment 2020-21 Academic YearThis coursework (CRWK) must be attempted in the groups of 4 or 5 students. Thiscoursework is divided into two sections: (1) Big Data analytics on a real case study and (2)group presentation. All the group members must attend the presentation. Presentationwould be online through Microsoft Teams. If you do not turn up in the presentation datewith the video call, you will fail the module.Overall mark for CRWK comes from two main activities as follows:1- Big Data Analytics report (around 3,000 words, with a tolerance of ± 10%) in HTMLformat (60%)2- Presentation (40%)Marking Scheme TopicTotalmarkRemarks(breakdown of marks for each sub-task)Big DataAnalytics usingSpark SQL30(6)Providing 2 queries using Spark SQL.(14)Developing advanced SQL statements. Refer to:https://spark.apache.org/docs/3.0.0/sql-ref.html(10)Visualizing the outcomes of queries into the graphical andtextual format, and be able to interpret them.Big DataAnalytics usingPySpark60(45)Analyzing the dataset through 3 statistical analyticsmethods including advanced descriptive statistics,correlation, hypothesis testing, density estimation, etc.(15)Designing one classifier, then evaluate and visualize theaccuracy/performance.Applying a multi-class classifier is considered for full mark.Documentation10(10)Write down a well-organized report for a programming andanalytics project.Total:100 IMPORTANT: you must use CRWK template in the HTML format, otherwise it will becounted as plagiarism and your group mark would be zero. Please refer to the “THEFORMAT OF FINAL SUBMISSION” section.Good Luck!4Big Data Analytics using SparkCN7031 – Big Data Analytics(1) Understanding Dataset: CSE-CIC-IDS20181This dataset was originally created by the University of New Brunswick for analyzing DDoSdata. You can find the full dataset and its description here. The dataset itself was based onlogs of the university’s servers, which found various DoS attacks throughout the publiclyavailable period to generate totally 80 attributes with 6.40GB size. We will use about 2.6GBof the data to process it with the restricted PCs to 4GB RAM. Download it from here. Whenwriting machine learning or statistical analysis for this data, note that the Label column isarguably the most important portion of data, as it determines if the packets sent are maliciousor not.a) The features are described in the “IDS2018_Features.xlsx” file in Moodle page.b) The labels are as follows:• “Label”: normal traffic• “Benign”: susceptible to DoS attackc) In this coursework, we use more than 8.2-million records with the size of 2.6GB. Asa big data specialist, firstly, we should read and understand the features, then applymodeling techniques. If you want to see a few records of this dataset, you can eitheruse [1] Hadoop HDFS and Hive, [2] Spark SQL or [3] RDD for printing a few recordsfor your understanding.1Source: https://registry.opendata.aws/cse-cic-ids2018/ & https://www.unb.ca/cic/datasets/ids-2018.html5(2) Big Data Query & Analysis using Spark SQL [30 marks]This task is using Spark SQL for converting big sized raw data into useful information. Eachmember of a group should implement 2 complex SQL queries (refer to the markingscheme). Apply appropriate visualization tools to present your findings numerically andgraphically. Interpret shortly your findings.You can use https://spark.apache.org/docs/3.0.0/sql-ref.html for more information.• What do you need to put in the HTML report per student?1. At least two Spark SQL queries.2. A short explanation of the queries.3. The working solution, i.e., plot or table.• Tip: The mark for this section depends on the level of your queries complexity, forinstance using the simple select query is not supposed for a full mark.(3)Advanced Analytics using PySpark [60 marks]In this section, you will conduct advanced analytics using PySpark.3.1. Analyze and Interpret Big Data using PySpark (45 marks)Every member of a group should analyze data through 3 analytical methods (e.g.,advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). Youneed to present your work numerically and graphically. Apply tooltip text, legend, title, X-Ylabels etc. accordingly.Note: we need a working solution without system or logical error for the good/full mark.3.2. Design and Build a Machine Learning (ML) technique (15 marks)Every member of a group should go over https://spark.apache.org/docs/3.0.0/ml-guide.htmland apply one ML technique. You can apply one the following approaches: Classification,Regression, Clustering, Dimensionality Reduction, Feature Extraction, Frequent Patternmining or Optimization. Explain and evaluate your model and its results into the numericaland/or graphical representations.Note: If you are 4 students in a group, you should develop 4 different models. If you havea similar model, the mark would be zero.6(4) Documentation [10 marks]Your final report must follow the “The format of final submission” section. Your work mustdemonstrate appropriate understanding of building a user friendly, efficient andcomprehensive analytics report for a big data project to help move users (readers) aroundto find the relevant contents.THE FORMAT OF FINAL SUBMISSION1- You can use either Google Colab (https://colab.research.google.com/) or UbuntuVMWare for this CRWK.2- You have to convert the source code (*.ipynb) to HTML. Watch the video in the Moodleabout “how to submit the report in HTML format”.3- Upload ONLY one single HTML file per group into Turnitin in Moodle. One member ofeach group must submit the work, NOT all members. The name of the file must be in theformat of “Your-Group-ID_CN7031”, such as Group200_CN7031.html if you arebelonging to the group 200.4- The submission link will be available from week 10, and you are free to amend yoursubmitted file several times before submission deadline. Your last submission will besaved in the Moodle database for marking.PLAGIARISMIf there are copied PySpark codes from somewhere or someone else, all the group memberswill get zero, and should attend the “breach of regulation” committee for further explanationsand the probable additional penalties.FEEDBACK TO STUDENTSFeedback is central to learning and is provided to students to develop their knowledge,understanding, skills and to help promote learning and facilitate improvement.• Feedback will be provided as soon as possible after the student has completedthe assessment task.• Feedback will be in relation to the learning outcomes and assessment criteria.• It will be offered via Turnitin GradeMark or Moodle post.As the feedback (including marks) is provided before Award & Field Board, marks are:• Provisional• available for External Examiner scrutiny• subject to change and approval by the Assessment Board7ASSESSMENT FORM FOR PRESENTATIONCN7031 – Big Data Analytics (40%) Students have to fill this section correctly. Assessors will not be liable for any mistakes.Group No: ……………….1st Student (full name and ID):2nd Student (full name and ID):3rd Student (full name and ID):4th Student (full name and ID):5th Student (full name and ID): Assessment Criteria: Criteria1st2nd3rd4th5thMarkDemonstrate/interpret Spark SQL queries10Understand Spark and its mechanism5Demonstrate/interpret PySpark codes15Ability to answer questions10Overall mark40 Date & Time: ………………………….Assessors’ signature and comments:

QUALITY: 100% ORIGINAL PAPER – NO PLAGIARISM – CUSTOM PAPER

Leave a Reply

Your email address will not be published. Required fields are marked *