<html> <style type=“text/css”> .comment { font-style: italic; color: #c0342d; } .name, .variable, code var, pre var { color: #9e5cb1; } .name { font-weight: bold; } .variable, pre var, code var { font-style: italic; } .keyword { color: #abafb3; } .builtin, .string { color: #417fb8; } .constant { color: #e89f27; } table { margin: 1em auto; } table td { padding: 1pt 4pt; } div.dw-content a { text-decoration: underline; } div.dw-content p, div.dw-content li, div.dw-content li p { font-size: 13pt; margin-bottom: 1em; max-width: 700px; } p code, li code { font-size: 11pt; } blockquote { font-size: inherit; } pre kbd, code kbd { background: inherit; color: inherit; box-shadow: none; padding: 0; font: inherit; font-weight: bold; } a.secret { color: inherit; text-decoration: none !important; } a.secret:hover { text-decoration: underline !important; } </style>

<h1 id=“lab-4-cleaning-table-data”>Lab 4: Cleaning Table Data</h1> <p>23 September 2022</p> <h2 id=“todays-lab”>Today’s lab</h2>

<p>The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.</p>

<p>Up until now, the data that you worked with in lab and on assignments has been sanitized for you – that is, you could expect Pyret to process the data without any extra work.</p>

<p>For this lab, we’ll analyze the data from the survey that you and the other students taking 101 filled out earlier this week. Since the data is being gathered directly from Google Forms submissions, there exists data that either isn’t valid or isn’t useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need.</p>

<h2 id=“getting-started”>Getting started</h2> <p><strong>Task</strong>: Go to the <a href=“https://docs.google.com/spreadsheets/d/10dXMqd5nyg20_-2rdtHLkEfnWxXDUhvw4j5_We4ssek/edit?usp=sharing”>response spreadsheet</a> and make a copy of it (by selecting <em>File</em> → <em>Make a Copy</em>).</p>

<img src=“https://www.cs.vassar.edu/~cs101/images/make_a_copy.png” width=“75%”>

<p>This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.</p>

<p>Once you’ve made a copy of the spreadsheet, you need to share the spreadsheet so Pyret can import it.</p>

<p><strong>Task</strong>: Press the green <em>Share</em> button in the upper right hand corner of the spreadsheet, then click on <em>Get link</em> and select <em>Anyone with the link</em> from the dropdown.</p>

<img src=“https://www.cs.vassar.edu/~cs101/images/get_link.png” width=“75%”>

<p>Now we’re ready to import the spreadsheet into a Pyret program.</p>

<p><strong>Task</strong>: Add the following code at the top of your program:</p> <pre>

<span class="keyword">include</span> gdrive-sheets
<span class="keyword">include</span> data-source

<span class="keyword">include</span> shared-gdrive(<span class="string">&quot;dcic-2021&quot;</span>,
  <span class="string">&quot;1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep&quot;</span>)


<var>ssid</var> = <span class="string">&quot;&quot;</span>  <span class="comment"># Put the ID of your spreadsheet here</span>
<var>data-sheet</var> = load-spreadsheet(ssid)

<var>student-data</var> =
  <span class="keyword">load-table</span>:
    timestamp, house, stem-level, sleep-hours,
    schoolwork-hours, student-athlete, extracurricular-hours
    <span class="keyword">source</span>: data-sheet.sheet-by-name(<span class="string">&quot;Responses&quot;</span>, <span class="constant">true</span>)
    <span class="keyword">sanitize</span> student-athlete <span class="keyword">using</span> string-sanitizer
  <span class="keyword">end</span>

student-data

</pre>

<p><strong>Task</strong>: Put the ID of your copy of the spreadsheet in your code above.</p>

<p><strong>Note</strong>: Your

ssid

will look like the highlighted portion of the spreadsheet URL:</p>

<blockquote> <p>

https://docs.google.com/spreadsheets/d/<span style="background: yellow; font-weight: bold">10dXMqd5nyg20_-2rdtHLkEfnWxXDUhvw4j5_We4ssek</span>/edit

</p> </blockquote>

<p>But your

ssid

will be different; don’t use this one!</p>

<p>To make sure the spreadsheet loaded correctly, check that you see the

student-data

table when you run the program. If you are not able to see your table, call your instructor or a coach over to help you troubleshoot.</p>

<h2 id=“part-1-cleaning-the-data”>Part 1: Cleaning the data</h2>

<p>Now that we have the data collected, we need to make sure that the data is clean and processed before we analyze it.</p>

<h3 id=“exercise-11-data-sanitizing”>Exercise 1.1: Data sanitizing</h3>

<p>Have a look at your

student-data

table. Notice that everything is represented as a string, even the numeric columns. Also notice the values in the last two columns are in the format

some(x)

or

none

.</p>

<p>In order to turn these values into the data types we all know and love –

String

,

Number

, etc. – we need to use the

sanitize

functions when we load the table. See the <a href=“https://dcic-world.org/2022-08-28/processing-tables.html#%28part._missing-data%29”>textbook</a> for an example of how to do this and the <a href=“https://www.pyret.org/docs/latest/data-source.html#%28part._.Pre-defined_sanitizers%29”>Pyret documentation</a> for a list of available sanitizers.</p>

<p><strong>Hint</strong>: You’ll want to use at least

num-sanitizer

and

string-sanitizer

.</p>

<p><strong>Task</strong>: Make a new table called

student-data-sanitized

, which is the same as

student-data

but using the sanitizers.</p>

<p>When you try to sanitize the

extracurricular-hours

column, Pyret should report one or more errors due to values that can’t be converted to numbers. To fix these sanitizer errors, you’ll have to edit the spreadsheet.</p>

<p>When you need to make edits to the data, it’s much better to edit a <em>copy</em> of the data, leaving the original data unchanged. So, that’s what we’ll do:</p>

<p><strong>Task</strong>: Copy our data into a new sheet in the spreadsheet by clicking on the small arrow in the <em>Responses</em> tab in the bottom left of the spreadsheet window and then select <em>Duplicate</em>.</p>

<img src=“https://www.cs.vassar.edu/~cs101/images/new_sheet.png” width=“75%”>

<p><strong>Task</strong>: Once you’ve duplicated the sheet, rename it to “Cleaned Data”.</p>

<img src=“https://www.cs.vassar.edu/~cs101/images/rename_sheet.png” width=“75%”>

<p>Now when you load the spreadsheet to make

student-data-sanitized

, your

source

line will look like this:</p>

<pre>

<span class="keyword">source</span>: data-sheet.sheet-by-name(<span class="string">&quot;Cleaned Data&quot;</span>, <span class="constant">true</span>)

</pre>

<p><strong>Task</strong>: Make edits to the “Cleaned Data” sheet of your spreadsheet until your data passes the sanitizer. If the data for the cell that’s giving an error makes no sense for that cell, it’s best to remove it. However, if there is a clear way clean up the data, you should edit the data for that cell.</p>

<p><strong>Task</strong>: Answer the following question with a block comment (

#| ... |#

) in your code. Make the first line of your comment

Question 1.1

to make it easy for the coach to find your answer.</p>

<p><strong>Question 1.1</strong>: What type of edits did you need to make to the data pass the sanitizer? How did you handle cells that were obviously not correct? How did you handle the other types of errors? Take a look at the “Cleaned Data” sheet in <a href=“https://docs.google.com/spreadsheets/d/162iotKSv2FmRvcr-DGTEmLZZGic_BYeJfUXRUEMpMD8/edit?usp=sharing”>this copy of the spreadsheet</a>, which contains our edits to the table. Do you agree with our edits? Did you do anything different?</p>

<h3 id=“exercise-12-normalizing-the-house-column”>Exercise 1.2: Normalizing the

house

column</h3>

<p>Now that we have our sanitized table, we can now work on normalizing the data in our columns. In the

"house"

column, students were asked to select which one of the nine houses they live in. For students who don’t live in one of those houses, they selected “other” and listed where they live.</p> <p><strong>Task</strong>: Use

transform-column

to normalize this column such that if the answer is not one of the nine Vassar houses, we should change the response to

"OTHER"

.</p>

<p>For reference, the nine houses listed on the form were:</p>

<ul> <li>

"Main Building (1861)"

</li> <li>

"Strong House (1893)"

</li> <li>

"Raymond House (1897)"

</li> <li>

"Lathrop House (1901)"

</li> <li>

"Davison House (1902)"

</li> <li>

"Jewett House (1907)"

</li> <li>

"Josselyn House (1912)"

</li> <li>

"Cushing House (1927)"

</li> <li>

"Noyes House (1958)"

</li> </ul>

<p><strong>Check yourself</strong>: How do you know you did the normalization correctly? You can cross check your data by using the

count

function on your new table. You should only see the nine houses represented in the count table, along with

"OTHER"

. (If there were no responses for a particular house, it won’t show up – this is fine.)</p>

<h3 id=“exercise-13-normalizing-the-student-athlete-column”>Exercise 1.3: Normalizing the

student-athlete

column</h3>

<p>Use the table you created in Exercise 1.2 as your starting point for this exercise.</p>

<p>To normalize the

"student-athlete"

column, we’re going to change it from a

String

to a

Boolean

:</p>

<p><strong>Task</strong>: Make a new table

student-data-cleaned

, where the

"student-athlete"

column is

true

for any response that has

"yes"

(in upper, lower, or mixed case) anywhere in the string and

false

otherwise.</p>

<p>Before you begin, take a moment to write down the tasks that you need to do for this computation. You’ll need to make use of a couple of <a href=“https://www.pyret.org/docs/latest/strings.html#%28part._.String_.Functions%29”>string functions</a> to do this transformation. Also remember that the

transform-column

can return a different type from its input type.</p>

<h2 id=“part-2-analyze-the-data”>Part 2: Analyze the data</h2>

<p>Now that the data has been tidied up, we can start asking questions about it.</p>

<h3 id=“exercise-21-using-count”>Exercise 2.1: Using

count

</h3>

<p><strong>Task</strong>: Use the

count

function to answer the following questions:</p> <ul> <li>What percentage of the survey responders are student-athletes?</li> <li>Which house has the most survey students living there?</li> </ul>

<p>You don’t need to come up with an expression that computes the answer directly; just write your answer in a comment.</p>

<h3 id=“exercise-22-data-visualization”>Exercise 2.2: Data visualization</h3>

<p><strong>Task</strong>: Show a

scatter-plot

which has the

"stem-level"

on the x-axis and

"schoolwork-hours"

on the y-axis. (Refer to the table documentation for plot functions.) Do you notice any correlation between the points?</p>

<p><strong>Task</strong>: Show a histogram of the

"sleep-hours"

column with a

bin-width

of

1

.</p>

<h3 id=“exercise-23-choose-your-own-adventure”>Exercise 2.3: Choose your own adventure</h3>

<p><strong>Task</strong>: Ask and answer at least one more question about the data. Discuss why you wanted to know the answer to that question.</p>

<h2 id=“optional-part-3-going-further”>(Optional) Part 3: Going further</h2>

<p>Congratulations! You have reached the end of lab. Here is an optional exercise in case you are looking for a challenge:</p>

<p><strong>Task</strong>: Write a function

percent-true

that takes a table and column name as input and returns the percent of rows that are

true

for the column specified.</p>

<pre>

<span class="keyword">fun</span> <span class="name">percent-true</span>(t :: Table, col :: String) -&gt; Number:
  <span class="keyword">doc</span>: <span class="string">&quot;Return the percentage of rows that are true in column &#39;col&#39;&quot;</span>
  ...
<span class="keyword">end</span>

</pre>

<p>What’s neat about this function is it will work on <em>any</em> table that has a column of type

Boolean

!</p>

<p><strong>Task</strong>: Use this helper function to find the percentage of survey responders who are student-athletes. Check to see if it’s the same answer you got for Exercise 2.1.</p>

<h2 id=“submitting-the-lab”>Submitting the lab</h2> <ul> <li><p>When you’ve completed the exercises, show your code to your instructor or one of the coaches.</p></li> <li><p>Then upload your

lab04.arr

file to the Lab 4 assignment on <a href=“https://www.gradescope.com”>Gradescope</a>.</p></li> </ul>

<h2 id=“just-for-fun”>Just for fun</h2>

<p>When you’re done with the lab, check out these <a href=“https://www.tylervigen.com/spurious-correlations”>spurious correlations</a>. It’s entertaining, we promise!</p>

<h2 id=“acknowledgments”>Acknowledgments</h2>

<p>This lab includes material adapted from Kathi Fisler and colleagues at Brown University and Jason Waterman at Vassar College.</p> </html>