<html> <h1 id=“lab-4-cleaning-table-data”>Lab 4: Cleaning Table Data</h1> <h2 id=“learning-objective”>Learning Objective</h2> <p>The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.</p> <p>Up until now, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being ported directly from Google Forms submissions, so there exists data that either isn’t valid or isn’t useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need. For this lab we'll analyze the data from the form you and the other students filled out earlier this week.</p> <h2 id=“getting-started”>Getting Started</h2> <p>Go to the <a href=“https://docs.google.com/spreadsheets/d/1aK2_X-qZ_u1bXHkYHzeuf-7BEuxrllse5OeIQDrh71M/edit?usp=sharing”>response spreadsheet</a>, make a copy of it (select “Make a Copy” under the “File” tab), and import it into your program.</p> <p><img src=“make_a_copy.png” width=“75%”></p> <p>This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.</p> <p>Once you've made a copy of the spreadsheet, you need to share the spreadsheet so Pyret can import it. Press the green "Share" button in the upper right hand corner of the spreadsheet, then click on "Get link" and select "Anyone with the link" from the dropdown.</p> <p><img src=“get_link.png” width=“75%”></p> <p>To import the spreadsheet, include the following code at the top of your program. You will put the
ssid
of your copy of the spreadsheet in your code below. </p> <p><strong>Note</strong>: The highlighted portion of the spreadsheet URL is what your ssid looks like:</p> <p><a href=“https://docs.google.com/spreadsheets/d/`1PCNs1fcNCiEepbDMc5u9Edar-EZjSqUwf-Z41i33fkM`/edit”>https://docs.google.com/spreadsheets/d/`1PCNs1fcNCiEepbDMc5u9Edar-EZjSqUwf-Z41i33fkM`/edit</a></p> <p>Your ssid will be different, so don't use the one above. </p> <pre>
<span class="hljs-built_in">include</span> shared-gdrive(<span class="hljs-string">"dcic-2021"</span>, <span class="hljs-string">"1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep"</span>) <span class="hljs-built_in">include</span> gdrive-sheets <span class="hljs-comment"># spreadsheet id from Google Sheets</span> ssid = <span class="hljs-string">""</span> <span class="hljs-comment"># Put the ssid of your spreadsheet here</span> data-sheet = <span class="hljs-built_in">load</span>-spreadsheet(ssid) student-data = <span class="hljs-built_in">load</span>-table: timestamp, house, stem-level, sleep-hours, schoolwork-hours, student-athlete, extracurricular-hours source: data-sheet.sheet-<span class="hljs-keyword">by</span>-name(<span class="hljs-string">"Form Responses 1"</span>, <span class="hljs-literal">true</span>) <span class="hljs-function"><span class="hljs-keyword">end</span> <span class="hljs-title">student-data</span></span>
</pre><p>See if you can view the
student-data
table to make sure the spreadsheet loaded correctly. If you are not able to see your table, call your instructor or a coach over to help you troubleshoot. </p> <h2 id=“part-1-cleaning-the-data”>Part 1: Cleaning the Data</h2> <p>Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.</p> <h3 id=“1-1-data-sanitizing”>1.1 Data Sanitizing</h3> <p>Once you import your
student-data
table, have a look at it. Notice that everything is represented as a string, even the numeric columns. Also notice the values in the last two columns are in the format
some(x)
or
none
. In order to turn these values into the data types we all know and love (
String
,
Num
, etc.) we need to use the
sanitize
functions upon import. See the <a href=“https://dcic-world.org/2021-08-21/processing-tables.html#%28part._missing-data%29”>texbook</a> for an example of how to do this and the <a href=“https://www.pyret.org/docs/latest/data-source.html#%28part._.Sanitizers%29”>manual</a> for a list of available sanitizers. <strong>Hint:</strong> you'll want to use at least the
num-sanitizer
and the
string-sanitizer
. Call your new table
student-data-sanitized
.</p> <p>When you run the sanitizer, your spreadsheet should give an error when it tries to sanitize the "extracurricular-hours" column. If it doesn't, look at which sanitizer you are using with the column. Since we are asking for hours, we should be using the
num-sanitizer
for this column. To fix these sanitizer errors, you'll have to edit the spreadsheet. If the data for the cell that is giving an error makes no sense for that cell, it's best to remove it. However, if there is a clear way clean up the data, you can edit the data for that cell.</p> <p>When you need to make edits to the data, it's much better to edit a copy of the data, leaving the original data in place. We'll do this by copying our data into a new sheet in the spreadsheet. Click on the small arrow in the "Form Responses 1" tab in the bottom left of the spreadsheet and then select "Duplicate".</p> <p><img src=“new_sheet.png” width=“50%”></p> <p>Once you've duplicated the sheet, rename it to "Cleaned Data". </p> <p><img src=“rename_sheet.png” width=“50%”></p> <p>Now when you load the spreadsheet, your <strong>source</strong> line will look like this:</p> <pre>
source: <span class="hljs-keyword">data</span>-sheet.sheet-<span class="hljs-keyword">by</span>-<span class="hljs-keyword">name</span>(<span class="hljs-string">"Cleaned Data"</span>, <span class="hljs-literal">true</span>)
</pre><p>Make edits to the "Cleaned Data" sheet of your spreadsheet until your data passes the sanitizer and answer the following question with a block comment in your code. Make the first line of your comment
Question 1.1
to make it easy for the coach to find your answer.</p> <p><strong>Question 1.1:</strong> What type of edits did you need to make to the data pass the sanitizer? How did you handle cells that were obviously not correct? How did you handle the other types of errors? Take a look at "Data" sheet, which contains our edits to the table. Do you agree with our edits? Did you do anything different?</p> <h3 id=“1-2-normalizing-the-house-column”>1.2 Normalizing the House Column</h3> <p>Now that we have our sanitized table, we can now work on normalizing the data in our columns. In the "house" column, students were asked to select which one of the nine houses they live in. For students who don't live in one of those houses, they selected "other" and listed where they live. What we would like to do is normalize this colum such that if the answer is not one of the nine Vassar houses, we should change the response to "OTHER". The nine houses listed on the form were:</p> <ul> <li>Main</li> <li>Strong</li> <li>Raymond</li> <li>Davison</li> <li>Lathrop</li> <li>Jewett</li> <li>Josselyn</li> <li>Cushing</li> <li>Noyes</li> </ul> <p><em>Hint:</em> you may find the
transform-column
function useful.</p> <p><strong>Check Yourself:</strong> How do you know you did the normalization correctly? You can cross check your data by using the
count
function on your new table. You should see all 9 houses represented in the count table, and the value for "OTHER" should be 5.</p> <h3 id=“1-3-normalizing-the-student-athlete-column”>1.3 Normalizing the Student-athlete Column</h3> <p>Use the table that created in part 1.2 as your starting point for this section.</p> <p>To normalize this column, we are going to transform the "student-athlete" column from a
String
to a
Boolean
. To do this, we'd like to take any response that has "yes" (in either upper or lower case) in any part of the string as
true
and
false
otherwise. Before you begin, take a moment to write down the tasks that you need to do for this computation. You'll need to make use of a couple of <a href=“https://www.pyret.org/docs/latest/strings.html#%28part._.String_.Functions%29”>string functions</a> to do this transformation. Also remember that the
transform-column
can return a different type from its input type.</p> <p>Give your new table the name
student-data-cleaned
. We'll use this table in Part2.</p> <h2 id=“part-2-analyze-the-data”>Part 2: Analyze the Data</h2> <p>Now that the data has been tidied up, we can start asking questions about it. </p> <h3 id=“2-1-using-count”>2.1 Using Count</h3> <p>Both of these questions can answered by using
count
. You don't need to come up with an expression that computes the answer directly. </p> <p>What percentage of the survey responders are student-athletes?</p> <p>Which house has the most survey students living there?</p> <h3 id=“2-2-data-visualization”>2.2 Data Visualization</h3> <p>Show a
scatter-plot
which has the "stem-level" on the x-axis and "schoolwork-hours" on the y-axis. Do you notice any correlation between the points?</p> <p>Show a
histogram
of the "sleep-level" column with a bin-width of 1.</p> <h3 id=“2-3-choose-your-own-adventure”>2.3 Choose Your own Adventure</h3> <p>Ask and answer at least one more question about the data. Discuss why you wanted to know the answer to that question.</p> <h2 id=“part-3-going-further-optional-”>Part 3: Going Further (optional)</h2> <p>Congratulations! You have reached the end of lab. Here is an optional exercise in case you are looking for a challenge. You are not required to do this exercise. </p> <h3 id=“3-1-percent-true”>3.1 Percent True</h3> <p>Write a helper function
percent-true
which takes a table and column name as input and returns a the percent of rows that are
true
for the column specified. What is neat about this helper function, is it will work on <strong>any</strong> table that has a column of type
Boolean
. Use this helper function to find the percentage of survey responders who are student-athletes. Check to see if it's the same answer you got for part 2.1.</p> <pre>
<span class="hljs-function"><span class="hljs-keyword">fun</span> percent-<span class="hljs-title">true</span><span class="hljs-params">(t :: <span class="hljs-type">Table</span>, col :: <span class="hljs-type">String</span>)</span></span> -> Number: doc: <span class="hljs-string">"returns the percentage of rows that in column 'col' that are true"</span> ... end
</pre><h2 id=“submitting-the-lab”>Submitting the Lab</h2> <p>When you've complete the exercises, show your code to your instructor or one of the coaches.</p> <p>Then, upload your
lab04.arr
file to the Lab 4 assignment on <a href=“https://www.gradescope.com”>Gradescope</a>.</p> <h2 id=“just-for-fun”>Just for Fun</h2> <p>If you have extra time, check out these <a href=“https://www.tylervigen.com/spurious-correlations”>spurious correlations</a>. It’s entertaining, we promise!</p> <h2 id=“acknowledgments”>Acknowledgments</h2> <p>This lab includes material adapted from Kathi Fisler and colleagues at Brown University.</p> </html>