How can we use a machine learning algorithm on this type of data?

https://stackoverflow.com/questions/21496219

05-10-2022
|

문제

Here is the scenario:

We have a website with the capability for students to create an e-portfolio, which is like a profile page combined with your projects you can add to it.

For each student portfolio we are going to have an educator review the portfolio and give it a set of scores based on the content of the portfolio. So a set of scores which will be summed to a total score will be associated with each students portfolio.

So we have score data, associated with portfolio data and we want to use this data as supervised training data for a machine learning algorithm. So then the computer can examine thousands of these cases, looks for patterns, provide insight and be able to predict scores for other portfolios.

Here is the data we are collecting for each person:

**Portfolio data:**

About: 'Text paragraph data written by the student about themselves'
Skills: 'Text Bullet list of skills'
Career Interests: 'Text Bullet list of career interests'
Work Experience: 'Text paragraph'
Education History: 'Student fills out Universities, majors, gpa, and dates attended'
Courses: 'Text bullet list of courses'
Interests: 'Text paragraph data written by student about interests'
Works: 'Each student adds works to there portfolio and enter the following data'
   Work Title: 'Text title'
   Attachments: 'File and documents attached to the portfolio (jpg, doc, pdf, youtube, dropbox, etc.)
   Work description: 'Text Description of work'
   category of works: 'Selected from list of categories'
   tags: 'list of test tags student adds to work'
   My contribution: 'Text description of students contribution to project'


**Score data we are collecting for each portfolio, each key area rated from 1-100:**

Content completeness:
Selection of Works:
Reflection:
Academic Concepts:
Presentation and Appearance:
Layout and Readability:
Use of Multimedia:
Audience:
Organization of content:
Written Communication:
TOTAL SCORE:

We plan to collect thousands of students portfolios and scores over time. What kind of algorithm could we use to analyze this data to find correlation between portfolios that received similar scores? Then use this data to predict how successful a portfolio will be once a student has filled it out. Please let me know if any of this is confusing or if you need more information, thanks so much!

해결책

There are a lot of issues you are trying to tackle here.

The first thing that comes to mind is to do feature extraction and then apply regression for predicting scores. Now since you're using more than just the text information from the portfolios you would need more than text features. I dont know what features'll help you correlate the "presentation and appearance" of the portfolio to their scores. One approach would be to get color, font, font-size information and represent them as features. For getting insights from the text you could use the vector space model for representing your text.

I shall get back and write a detailed answer soon. I am sorry if all of this sounds too vague right now.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow