<html> <h1 id=“lab-4-cleaning-table-data”>Lab 4: Cleaning Table Data</h1> <h2 id=“learning-objective”>Learning Objective</h2> <p>The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.</p> <p>Up until now, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being ported directly from Google Forms submissions, so there exists data that either isn’t valid or isn’t useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need. For this lab we&#39;ll analyze the data from the form you and the other students filled out earlier this week.</p> <h2 id=“getting-started”>Getting Started</h2> <p>Go to the <a href=“https://docs.google.com/spreadsheets/d/1aK2_X-qZ_u1bXHkYHzeuf-7BEuxrllse5OeIQDrh71M/edit?usp=sharing”>response spreadsheet</a>, make a copy of it (select “Make a Copy” under the “File” tab), and import it into your program.</p> <p><img src=“make_a_copy.png” width=“75%”></p> <p>This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.</p> <p>Once you&#39;ve made a copy of the spreadsheet, you need to share the spreadsheet so Pyret can import it. Press the green &quot;Share&quot; button in the upper right hand corner of the spreadsheet, then click on &quot;Get link&quot; and select &quot;Anyone with the link&quot; from the dropdown.</p> <p><img src=“get_link.png” width=“75%”></p> <p>To import the spreadsheet, include the following code at the top of your program. You will put the

ssid

of your copy of the spreadsheet in your code below. </p> <p><strong>Note</strong>: The highlighted portion of the spreadsheet URL is what your ssid looks like:</p> <p><a href=“https://docs.google.com/spreadsheets/d/`1PCNs1fcNCiEepbDMc5u9Edar-EZjSqUwf-Z41i33fkM`/edit”>https://docs.google.com/spreadsheets/d/`1PCNs1fcNCiEepbDMc5u9Edar-EZjSqUwf-Z41i33fkM`/edit</a></p> <p>Your ssid will be different, so don&#39;t use the one above. </p> <pre>

<span class="hljs-built_in">include</span> shared-gdrive(<span class="hljs-string">"dcic-2021"</span>,
  <span class="hljs-string">"1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep"</span>)
<span class="hljs-built_in">include</span> gdrive-sheets

<span class="hljs-comment"># spreadsheet id from Google Sheets</span>
ssid = <span class="hljs-string">""</span> <span class="hljs-comment"># Put the ssid of your spreadsheet here</span>
data-sheet = <span class="hljs-built_in">load</span>-spreadsheet(ssid)
student-data =
  <span class="hljs-built_in">load</span>-table: timestamp, house, stem-level, sleep-hours, schoolwork-hours, 
  student-athlete, extracurricular-hours
  source: data-sheet.sheet-<span class="hljs-keyword">by</span>-name(<span class="hljs-string">"Form Responses 1"</span>, <span class="hljs-literal">true</span>)
  <span class="hljs-function"><span class="hljs-keyword">end</span>
<span class="hljs-title">student-data</span></span>

</pre><p>See if you can view the

student-data

table to make sure the spreadsheet loaded correctly. If you are not able to see your table, call your instructor or a coach over to help you troubleshoot. </p> <h2 id=“part-1-cleaning-the-data”>Part 1: Cleaning the Data</h2> <p>Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.</p> <h3 id=“1-1-data-sanitizing”>1.1 Data Sanitizing</h3> <p>Once you import your

student-data

table, have a look at it. Notice that everything is represented as a string, even the numeric columns. Also notice the values in the last two columns are in the format

some(x)

or

none

. In order to turn these values into the data types we all know and love (

String

,

Num

, etc.) we need to use the

sanitize

functions upon import. See the <a href=“https://dcic-world.org/2021-08-21/processing-tables.html#%28part._missing-data%29”>texbook</a> for an example of how to do this and the <a href=“https://www.pyret.org/docs/latest/data-source.html#%28part._.Sanitizers%29”>manual</a> for a list of available sanitizers. <strong>Hint:</strong> you&#39;ll want to use at least the

num-sanitizer

and the

string-sanitizer

. Call your new table

student-data-sanitized

.</p> <p>When you run the sanitizer, your spreadsheet should give an error when it tries to sanitize the &quot;extracurricular-hours&quot; column. If it doesn&#39;t, look at which sanitizer you are using with the column. Since we are asking for hours, we should be using the

num-sanitizer

for this column. To fix these sanitizer errors, you&#39;ll have to edit the spreadsheet. If the data for the cell that is giving an error makes no sense for that cell, it&#39;s best to remove it. However, if there is a clear way clean up the data, you can edit the data for that cell.</p> <p>When you need to make edits to the data, it&#39;s much better to edit a copy of the data, leaving the original data in place. We&#39;ll do this by copying our data into a new sheet in the spreadsheet. Click on the small arrow in the &quot;Form Responses 1&quot; tab in the bottom left of the spreadsheet and then select &quot;Duplicate&quot;.</p> <p><img src=“new_sheet.png” width=“50%”></p> <p>Once you&#39;ve duplicated the sheet, rename it to &quot;Cleaned Data&quot;. </p> <p><img src=“rename_sheet.png” width=“50%”></p> <p>Now when you load the spreadsheet, your <strong>source</strong> line will look like this:</p> <pre>

source: <span class="hljs-keyword">data</span>-sheet.sheet-<span class="hljs-keyword">by</span>-<span class="hljs-keyword">name</span>(<span class="hljs-string">"Cleaned Data"</span>, <span class="hljs-literal">true</span>)

</pre><p>Make edits to the &quot;Cleaned Data&quot; sheet of your spreadsheet until your data passes the sanitizer and answer the following question with a block comment in your code. Make the first line of your comment

Question 1.1

to make it easy for the coach to find your answer.</p> <p><strong>Question 1.1:</strong> What type of edits did you need to make to the data pass the sanitizer? How did you handle cells that were obviously not correct? How did you handle the other types of errors? Take a look at &quot;Data&quot; sheet, which contains our edits to the table. Do you agree with our edits? Did you do anything different?</p> <h3 id=“1-2-normalizing-the-house-column”>1.2 Normalizing the House Column</h3> <p>Now that we have our sanitized table, we can now work on normalizing the data in our columns. In the &quot;house&quot; column, students were asked to select which one of the nine houses they live in. For students who don&#39;t live in one of those houses, they selected &quot;other&quot; and listed where they live. What we would like to do is normalize this colum such that if the answer is not one of the nine Vassar houses, we should change the response to &quot;OTHER&quot;. The nine houses listed on the form were:</p> <ul> <li>Main</li> <li>Strong</li> <li>Raymond</li> <li>Davison</li> <li>Lathrop</li> <li>Jewett</li> <li>Josselyn</li> <li>Cushing</li> <li>Noyes</li> </ul> <p><em>Hint:</em> you may find the

transform-column

function useful.</p> <p><strong>Check Yourself:</strong> How do you know you did the normalization correctly? You can cross check your data by using the

count

function on your new table. You should see all 9 houses represented in the count table, and the value for &quot;OTHER&quot; should be 5.</p> <h3 id=“1-3-normalizing-the-student-athlete-column”>1.3 Normalizing the Student-athlete Column</h3> <p>Use the table that created in part 1.2 as your starting point for this section.</p> <p>To normalize this column, we are going to transform the &quot;student-athlete&quot; column from a

String

to a

Boolean

. To do this, we&#39;d like to take any response that has &quot;yes&quot; (in either upper or lower case) in any part of the string as

true

and

false

otherwise. Before you begin, take a moment to write down the tasks that you need to do for this computation. You&#39;ll need to make use of a couple of <a href=“https://www.pyret.org/docs/latest/strings.html#%28part._.String_.Functions%29”>string functions</a> to do this transformation. Also remember that the

transform-column

can return a different type from its input type.</p> <p>Give your new table the name

student-data-cleaned

. We&#39;ll use this table in Part2.</p> <h2 id=“part-2-analyze-the-data”>Part 2: Analyze the Data</h2> <p>Now that the data has been tidied up, we can start asking questions about it. </p> <h3 id=“2-1-using-count”>2.1 Using Count</h3> <p>Both of these questions can answered by using

count

. You don&#39;t need to come up with an expression that computes the answer directly. </p> <p>What percentage of the survey responders are student-athletes?</p> <p>Which house has the most survey students living there?</p> <h3 id=“2-2-data-visualization”>2.2 Data Visualization</h3> <p>Show a

scatter-plot

which has the &quot;stem-level&quot; on the x-axis and &quot;schoolwork-hours&quot; on the y-axis. Do you notice any correlation between the points?</p> <p>Show a

histogram

of the &quot;sleep-level&quot; column with a bin-width of 1.</p> <h3 id=“2-3-choose-your-own-adventure”>2.3 Choose Your own Adventure</h3> <p>Ask and answer at least one more question about the data. Discuss why you wanted to know the answer to that question.</p> <h2 id=“part-3-going-further-optional-”>Part 3: Going Further (optional)</h2> <p>Congratulations! You have reached the end of lab. Here is an optional exercise in case you are looking for a challenge. You are not required to do this exercise. </p> <h3 id=“3-1-percent-true”>3.1 Percent True</h3> <p>Write a helper function

percent-true

which takes a table and column name as input and returns a the percent of rows that are

true

for the column specified. What is neat about this helper function, is it will work on <strong>any</strong> table that has a column of type

Boolean

. Use this helper function to find the percentage of survey responders who are student-athletes. Check to see if it&#39;s the same answer you got for part 2.1.</p> <pre>

<span class="hljs-function"><span class="hljs-keyword">fun</span> percent-<span class="hljs-title">true</span><span class="hljs-params">(t :: <span class="hljs-type">Table</span>, col :: <span class="hljs-type">String</span>)</span></span> -&gt; Number:
  doc: <span class="hljs-string">"returns the percentage of rows that in column 'col' that are true"</span>
  ...
end

</pre><h2 id=“submitting-the-lab”>Submitting the Lab</h2> <p>When you&#39;ve complete the exercises, show your code to your instructor or one of the coaches.</p> <p>Then, upload your

lab04.arr

file to the Lab 4 assignment on <a href=“https://www.gradescope.com”>Gradescope</a>.</p> <h2 id=“just-for-fun”>Just for Fun</h2> <p>If you have extra time, check out these <a href=“https://www.tylervigen.com/spurious-correlations”>spurious correlations</a>. It’s entertaining, we promise!</p> <h2 id=“acknowledgments”>Acknowledgments</h2> <p>This lab includes material adapted from Kathi Fisler and colleagues at Brown University.</p> </html>