Ms.D.Viji to study and analyze educational data



Assistant Professor

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Department of CSE

SRM University, Kattankulathur. [email protected]



Manasa Kandala

UG Student

Department of CSE

SRM University, Kattankulathur. [email protected]


Jashan Uppal

UG Student

Department of CSE

SRM University, Kattankulathur.

[email protected]





In this era of computerization, education has also revitalized itself and is no more limited to the old methods. The quest to find new and advanced ways to make educational system more efficient and to make students intellect has begun. These days, a lot of data is collected in educational databases, but it remains unutilized in the database. To make legitimate use of such a large amount of data, powerful tools and algorithms are required. It is very important to study and analyze educational data to help & improvise the students. Educational Data Mining (EDM) is an emerging field, exploring data in educational context by applying different Data Mining (DM) techniques/tools. It provides intrinsic knowledge of teaching and learning process for effective educational planning. This paper presents a comprehensive survey, a travelogue towards educational data mining & its scope in future.




In the span of last 10-20 years, the number of educational institutions has procreated rapidly. A large number of graduates are produced by them every year. Institutes may follow best of the inculcation methods; but still, they face the problem of dropout students, low achievers, and unemployed students.


Educational Data Mining (EDM) is an emerging field exploring data in educational context by applying different Data Mining (DM) techniques. EDM inherits properties from areas like Learning Analytics, Artificial Intelligence, Information Technology, Machine learning, Statics, Database Management System, Computing, and Data Mining. It can be considered an interdisciplinary research field which provides intrinsic knowledge of teaching and learning process for effective education.


Educational Data Mining is a new trend in the data mining and Knowledge Discovery in Databases (KDD) field which focuses in mining useful patterns and discovering useful knowledge from the educational information systems, such as, admissions systems, registration systems, course management systems and any other systems dealing with students from schools, to colleges and universities. Researchers in this field focus on discovering useful knowledge either to help the educational institutes manage their students in a better fashion or to help students to improvise their education and enhance their performance.


Understanding and analyzing the factors for poor performance is a complex and ceaseless process based on the past and present information besieged from academic performance and students’ behavior. Powerful techniques and algorithms are required to analyze and predict the performance of students scientifically.


Although institutions collect a humongous number of students’ data, this data remains unutilized and does not help in any way to improve the performance of students.


If Institutions could identify the factors for a low performance earlier and is able to predict students’ behavior, this knowledge can help them in taking pro-active actions, so as to improve the performance of such students. It will be a win-win situation for all of them involved i.e. management, teachers, students, and parents. Students will be able to identify their weaknesses beforehand and can improve themselves. Teachers will be able to plan their lectures as per the need of students and can provide better guidance to such students. Parents will be reassured of their ward performance in such institutes. Eventually, this will help in the proper growth of the nation.





Baradwaj and Pal conducted a research on a group of 50 students enrolled in a specific course program across a period of 4 years, with multiple performance indicators, which includes

q  Previous Semester Marks

q  Class Test Grades

q  Seminar Performance

q  Assignments

q  General Proficiency

q  Attendance

q  Lab Work

q  End Semester Marks

They used ID3 decision tree algorithm to construct a decision tree, if-then rules. This application is supposed to help the instructors as well as the students to better understand and predict students’ performance at the end of the semester. They defined their objective of this study as: “This study will also work to identify those students which needed special attention to reducing fail ration and taking appropriate action for the next semester examination”.


Abeer and Elaraby conducted a research that mainly focuses on generating classification rules and predicting students’ performance in a selected course program based on previously recorded students’ behavior and activities. They processed and analyzed previously enrolled students’ data in a specific course program across 6 years, with multiple attributes collected from the university. As a result, they were able to predict, the students’ final grades in the selected course program. They defined their objective of study as: “Help the students to improve the student’s performance, to identify those students which needed special attention to reducing failing ration and taking appropriate action at right time”.


Bhardwaj and Pal conducted a significant data mining research using the Naïve Bayes classification method, on a group of BCA students. A questionnaire was conducted with the help of each and every student before the final examination, which had multiple personal, social questions which were used in the study to identify the relationship between these factors and the student’s performance and grades. They stated their main objectives of this study as:

q  Generation of a data source of predictive variables

q  Identification of different factors, which affects a student’s learning behavior and performance during academic career

q  Construction of a prediction model using classification data mining techniques on the basis of identified predictive variables

q Validation of the developed model for higher education students studying in Indian Universities or Institutions.

They found that the most influencing factor for student’s performance is his grade in senior secondary school, i.e. those students who performed well in their secondary school, will definitely perform well in their bachelors. It was also found that the living location, medium of teaching, mother’s qualification, student other habits, family annual income, and student family status, highly contribute to the students’ educational performance.


Baker and Yacef describe the following to be the four goals of EDM:

q Predicting student’s future learning behavior

q Discovering or improving domain models

q Studying the effects of educational support

q Advancing scientific knowledge about learning and learners

Predicting student’s future learning behavior – This goal can be achieved by creating student models that incorporate the learner’s characteristics, including detailed information such as their knowledge, behaviors, and motivation to learn.

Discovering or improving domain models – Through the various methods and applications of EDM, the discovery of new and improvements to existing models is possible. 

Studying the effects of educational support – It can be achieved through learning systems.

Advancing scientific knowledge about learning and learners – By building and incorporating student models, the field of EDM research and the technology can be improvised to a lot extent.



Data mining refers to extracting or “mining” knowledge from large amounts of data. Data mining techniques are used to operate on a large amount of data to find new and hidden patterns, relationships which can be helpful in decision making.

The various techniques used in Data Mining are:


q Association analysis

Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for transaction data analysis.


q  Prediction

In prediction, the goal is to develop a model which can infer a single aspect of data from some combination of other aspects of data. If we study prediction extensively then we get three types of prediction: classification, regression, and density estimation. In any category of prediction, the input variables will be either categorical or continuous.

Classification is the processing of finding a set of models (or functions) which describe and distinguish data classes or concepts, for the purposes of being able to use the model to predict the class of objects whose class label is unknown.


q  Clustering Analysis

Where classification and predication analyze class labeled data objects, clustering analyzes data objects without consulting a known class label. In general, the class labels are not present in the training data simply because they are not known, to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.

That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived.



Naive Bayes: classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle – “every pair of features being classified is independent of each other”.

Naïve model is the default model that predicts the classes of all examples in a dataset as the class of its mode (highest frequency). For example, let’s consider a dataset of 100 records and 2 classes (Yes & No), the “Yes” occurs 70 times and “No” occurs 30 times, the default model for this dataset will classify all objects as “Yes”, hence, its accuracy will be 70%. Even though it is useless but equally important, it allows evaluating the accuracies produced by other classification models. This concept can be generalized to all classes/labels in the data to produce an expectation of the class recall as well.


Data preprocessing:

The input data records may or may not be the same length. For example, if you’re working with sentences for sentiment analysis they’ll be of various lengths.

Since the sentences are of different length, we pad our sentences with a special  token to make the lengths of the two sentences equal, if documents are longer, they will be trimmed.

So now we have our sentences modified:Sentence 1: the camera quality is very goodSentence 2: the battery life is good .Now, both the sentences are of the same length. We proceed to build the vocabulary index.Vocabulary index is a mapping of integer to each unique word in the corpus.In our case, size of vocabulary index will be 9, since there are 9 unique tokens. Vocabulary is as follows 

Corresponding code

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

In tensorflow, tensorflow.contrib.learn.preprocessing.VocabularyProcessor is used for building the vocabulary.

The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length. Each row in raw_documents variable will be mapped to a vector of length max_document_length. You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors.

Next, each sentence is converted into a vector of integers.Sentence 1: 1, 2, 3, 4, 5, 6Sentence 2: 1, 7, 8, 4, 6, 0







Data mining is a tremendously vast area that includes employing different techniques and algorithms for pattern finding. The algorithms discussed in this paper are the ones used in education mining. These algorithms have shown a remarkable improvement in strategies like course outline formation, teacher-student understanding, and high output and turn out ratio. The ICDM conference encourages employment and development of algorithms helpful in data mining. An appreciable research is still being done on various algorithms.

Prediction with data mining has reaped benefits; such as finding the set of weak students, determining student’s satisfaction for a particular course, Faculty Evaluation, Comprehensive student evaluation, Classroom teaching language selection, predicting students’ dropout, course registration planning, predicting the enrollment headcount, evaluation of collaborative activities etc.

One of the most recent and biggest challenge that higher education faces today is making students skillfully employable. Many universities/institutes are not in a position to guide their students because of lack of information and assistance from their teaching-learning systems. To better administer and serve student population, the universities/institutions need better assessment, analysis, and prediction tools.




q  Nat’l Research Council, Building a Workforce for the Information Economy, Nat’l Academies Press, 2001.

q   C. Romero, S. Ventura, and E. Garca,”Data Mining in Course Management Systems: Moodle Case Study and Tutorial,” Computers & Education, vol. 51, no. 1, 2008, pp. 368–384.

q  . L. Pappano, “The Year of the MOOC,”The New York Times, 2 Nov. 2012;

q   Z. Pardos et al., “Adapting Bayesian Knowledge Tracing to a Massive Open Online Course in edX,” Proc. 6th Int’l Conf. Educational Data Mining (EDM 13), 2013; www.educational /rn_paper_21.pdf.

 A. Elbadrawy, R.S. Studham, and G. Karypis, “Collaborative Multiregression Models for Predicting Students’ Performance in Course Activities,” Proc. 5th Int’l Conf. Learning Analytics and Knowledge (LAK 15), 2015, pp. 103–107.