Home / Psets / Problem Set 5: Modeling Temperature Change

Problem Set 5: Modeling Temperature Change

The questions below are due on Monday May 11, 2020; 11:00:00 PM.
 
You are not logged in.

If you are a current student, please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.csail.mit.edu) to authenticate, and then you will be redirected back to this page.
**Checkoff: ** There is no checkoff for this Pset.
Download Files

Objectives:

  • Generate graphs based on experimental data
  • Use regression analysis to model experimental data
  • Analyze and evaluate trends based on regression models

Collaboration:

  • Students may work together, but students are not permitted to look at or copy each other’s code or code structure.
  • Each student should write up and hand in their assignment separately.
  • Include the names of your collaborators in a comment at the start of each file.
  • Please refer to the collaboration policy in the Course Information for more details.

Introduction

In this pset, we will evaluate the 'controversial' change in temperature over time in the U.S. Using data from the National Centers for Environmental Information (NCEI), you will use regression analysis to model the temperature of different cities in the United States.

Getting Started

Download ps5.zip from the problem set website. This folder contains the following files:

  • data.csv: comma separated value file containing the NCEI temperature data
  • ps5.py: skeleton code
  • ps5_tester.py: tester code
  • ps5_writeup.docx: template Word document for your write-up (Google Doc version)

Make sure all these files are saved in the same directory. Please do not rename these files, change any of the provided helper functions, or any function signatures.

Libraries

In this pset we will be using:

  • numpy library to store and manipulate data
  • matplotlib.pyplot to plot data

Write-up

In addition to writing code in ps5.py, you will also be submitting a write-up using the provided template. You will be asked to include screenshots of your generated plots. If you believe your plots are wrong, you can include an additional explanation of what you expect the plot to look like.

Since there is no checkoff, the write-up is your opportunity to demonstrate your understanding of your work.** Please write in full sentences**. Your answers to each question should not exceed 2-3 sentences.

1) Linear Regression

The Dataset Class

In ps5.py, we have provided you with the Dataset class. You will be using this class to access the data in data.csv for use throughout this pset. Read through the docstrings, and make sure you understand what each function does and how to use them (you do not need to understand the implementation details).

Open up data.csv and look at how the raw data is formatted. Each row specifies

  • the city,
  • the average daily temperature in Celsius, and
  • the date (within 1961-2016 period).

Minimizing the Squared Error

In this part, we will write code to fit a simple linear regression to our data set.

For each data point, we have the following information:

  • the x-coordinate (the independent variable) is an int representing the year of the sample (e.g. 1997), and
  • the y-coordinate (the dependent variable) is a float representing the average temperature observed that year.

Our goal is to find the line of the form y = mx + b that best fits the dataset. In other words, we want to find the line that best predicts the temperature values as a function of the year. The best-fit line minimizes the total squared error across all data points.

The squared error (SE) for a given point (x, y) is a function of the actual value y and the regression estimate y'.

SE(y, y') = (y – y')^2

Our goal is to minimize the total squared error, \sum_{i=1}^{n} SE(y_i, y_i'), for the n points in our data set. We can do this by taking the derivative and finding the critical points, which gives us the following equations:

m = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}

b= \bar{y}-(m* \bar{x})

where \bar{x} and \bar{y} are average of all the x and y values, respectively.

Implement the **linear_reg** function according to the provided docstring. Please use the equations above to implement this function. Do **NOT** use **numpy.polyfit** for this problem. _Run the **ps5_tester.py** file to test your implementation; your code should pass **test_linear_reg**_.

Evaluating the Fit

Now that we have calculated the best-fit line, we will assess how to actually fits our data set by calculating the total squared error across all the data points. Remember the total squared error is calculated using the following formula:

\sum_{i=1}^{n} (y_i – y')^2

Implement the **total_squared_error** function according to the provided docstring. _Your code should pass **test_total_squared_error** in **ps5_tester.py**_.

2) Curve Fitting

In this section, you will write code to fit several polynomial regression models to our data set.

Implement the **fit_models** function according to the provided docstring. This function should return a list of best-fit polynomial models for a given data set. A model is defined as a one-dimensional numpy array containing the coefficients of the polynomial. Therefore, the value returned by this function is a **list of numpy arrays**. _Your code should now pass **test_fit_models**._

Example:

>>> print(fit_models(np.array([1961, 1962, 1963]),   
                          np.array([-4.4, -5.5, -6.6]),  [1, 2]))
[array([ -1.10000000e+00,   2.15270000e+03]),
array([  8.86320195e-14,  -1.10000000e+00,   2.15270000e+03])]

Note that due to numerical errors when working with floats, it is fine if you do not get this exact output, but it should be very close.

####Hints:

  • Check out the documentation for numpy.polyfit. It is useful for this problem.
  • The returned list of polynomial regression models should appear in the same order as their corresponding integers in the degrees parameter.
  • The inputs x and y are one-dimensional numpy arrays, NOT Python lists. Although they are similar, do not expect them to always behave like Python lists. Some functions you will be calling only work with numpy arrays, so be careful!
  • The documentation for numpy N-dimensional arrays can be found here.

3) Evaluating the Models

In this problem, we will write a function that uses numerical and graphical metrics to evaluate the regression models generated by fit_models.

More specifically, we will use the following:

  • The model’s R^2 value (i.e. its coefficient of determination). You are welcome to use the r2_score function that we have imported for you in ps5.py.
  • A plot of the data samples and the polynomial curves.
  • The standard error over slope (only for linear regression models).
**Standard Error Over Slope**

The ratio of the standard error of the fitted curve’s slope to the slope measures how likely it is that you would see the upward or downward trend in your data and fitting curve purely by chance. **The larger the absolute value of this ratio is, the more likely it is that the trend is due to chance.** If you are interested in learning more, check out: [Hypothesis Test for Regression Slope](http://stattrek.com/regression/slope-test.aspx?Tutorial=AP).

**In our case, if the absolute value of the ratio is less than 0.5, the trend is significant (i.e. not by chance).** We have written a helper function `standard_error_over_slope` that calculates this value for you.

**Implement the function `evaluate_models_training` according to its docstrings.**

**Hint:** You might find [numpy.poly1d](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.poly1d.html) or [numpy.polyval](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.polyval.html) helpful for calculating the predicted y values based on the model. Refer to Lecture 12 of 6.0001 for detailed explanations on plotting using [matplotlib.pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html).

You should make a separate plot for each model. Your figure should adhere to the following guidelines:
  • **Plot the data points** as individual blue dots.
  • **Plot the model** with a red solid line color.
  • **Include a title**. Your title should include the $R^2$ value of the model and the degree. If the model is a linear curve (i.e. its degree is one), the title should also include the ratio of the standard error of this fitted curve's slope to the slope.
  • **Label the axes**. You may assume this function will only be used in the case where the x-axis represents years and the y-axis represents temperature in degrees Celsius.
Please round your $R^2$ and SE/slope values to **4 decimal places**. The R^2 values will be tested by `ps5_tester.py` but the graphs will not. However, you will use the graphs from this function for discussion questions 4A, 4B, 5B, 5C, and 6B. _You should now pass **test_evaluate_models_training** in the tester._

4) Sampling the Data

Now that we have all the components we need, we can select data samples from the raw temperature records and investigate trends for those samples. In this problem, we will try out two different methods of sampling data.

Part 4A: Daily Temperature

For our first sampling method, we will pick an arbitrary day of the year and see whether we can find any trends in the temperature on this day across many years.

**Write your code for Problem 4A under** `if __name__ == '__main__'`.
Use the `Dataset` class to create a data set where
  • the **x-coordinates** are the years from 1961 to 2016 and
  • the **y-coordinates** are the temperature measurements in San Francisco on December 12th for each year

Next, fit the data to a degree-one polynomial with `fit_models` and plot the regression results using `evaluate_models_training`. Include your plot in the writeup under Plot 4A

Part 4B: Annual Temperature

Let’s try another way to sample data points. Instead of looking at the change in temperature for a single day, we will look at the change in the average annual temperature. Repeat the steps in 4A, but now let the y-coordinates be the average annual temperature in San Francisco. Include this plot in the writeup under Plot 4B

1. **Implement the function** `get_cities_averages` **according to its docstrings**.

**Example**:
```>>> print(get_cities_averages(dataset, ['TAMPA', 'DALLAS'], range(2008, 2015)))```
```[21.76502732 21.24561644 20.8040411 22.03910959 22.27206284 21.31136986 20.88123288]```

The first element in the returned numpy array corresponds to the average of the annual temperatures for Tampa and Dallas in the year 2008.

Your code should pass `test_get_cities_averages`.

2. Use `generate_city_averages` **to generate the y-coordinates for your new data set under** `if __name__ == '__main__'`.

Fit your dataset to a degree-one polynomial with fit_models and plot the regression results using `evaluate_models_training`.

In **ps5_writeup.pdf**, complete the following sections:
  • Plot from 4A
  • Plot from 4B
  • 4.1 What difference does choosing a specific day to plot the data versus calculating the yearly average have on the goodness of fit of the model? Interpret the results.
  • 4.2 Why do you think these graphs are so noisy?

5) Long-Term vs. Short-Term Trends

Looking at trends within smaller time intervals may lead us to make different conclusions from the data. In this problem, you will make use of linear regression models to construct conflicting narratives for the change in the average annual temperature in Tampa.

Part 5A: Most Extreme Slope

**Implement the function** `identify_extreme_trend` **according to its docstrings.**

This function takes as arguments the x and y numpy arrays of samples, an interval length, and a specified trend. It returns a tuple that represents the start and end **indices** of the x and y numpy arrays whose linear regression model produces the most positive or most negative slope (as specified by the `positive_slope` parameter).

**Note**: Due to floating point precision errors, use a tolerance of 1e-8 to compare slope values (i.e. a float x and a float y are considered equal if $abs(x - y) <= 1e-8$).

Your code should pass `test_identify_extreme_trend`.

Part 5B: Increasing

Suppose you are the mayor of Tampa, and you are taking a data-driven approach to policy. You want to call your citizens to action against rising temperatures by finding data supporting the claim that temperatures are increasing in Tampa.

**Write your code for Problem 5B under** `if __name__ == '__main__'`.

Use `identify_extreme_trend` to identify a window of **30 years** that demonstrates that the average annual temperature in Tampa is **rising**. Plot the corresponding model with `evaluate_models_training` and include it under section 5B in the writeup.

In **ps5_writeup.pdf**, complete the following sections:
  • Increasing Plot
  • 5.1 What were the start and end years for your window? What was the slope?

###Part 5C: Decreasing

The political group “Turn Down the AC” has donated 1 trillion dollars to your campaign, as long as you amend your previous statements about temperature change to a more agnostic opinion.

**Write your code for Problem 5C under** `if __name__ == '__main__'`.

Use `identify_extreme_trend` to identify a window of **15 years** that demonstrates that the average annual temperature in Tampa is **decreasing**. Plot the corresponding model with `evaluate_models_training`.

In **ps5_writeup.pdf**, complete the following sections:
  • Decreasing Plot
  • 5.2 What were the start and end years for your window? What was the slope?
  • 5.3 Considering both plots, what conclusions, if any, can you make with respect to how temperature is changing over time based on your models their goodness of fit?

6) Predicting the Future

Now, we are curious whether we can predict future temperatures based on models created from historical data.

We will divide our original 1961-2016 dataset into two sets of data: 1961-1999 will serve as our training data, and 2000-2016 will serve as our testing data. We will use the training data to generate our models, and then we will evaluate the accuracy of our models at predicting the temperatures during the testing interval.

Use the provided variables TRAINING_INTERVAL and TESTING_INTERVAL to represent these ranges of years in your code.

Part 6A: RMSE

Before we use our models to predict “future” data points (i.e. temperatures for years later than 1999), we should think about how to evaluate a model’s performance on this task. Recall that R^2 measures how closely a model matches the training data, which does not give us any indication of how well the model fits the testing data. One way to evaluate a model’s performance on test data is with the Root Mean Square Error (RMSE). This measures the deviation between a model’s predicted values and the actual values across all the data samples.

RMSE can be found as follows:

RMSE = \sqrt{\frac{\sum^n(y_{obs,i} - y_{pred,i})^2} {n} }

  • y_{pred,i} is the predicted y-value for the ith data point based on the regression model
  • y_{obs,i} is the actual y-value for the ith data point based on the raw data
  • n is the number of data points
1. Implement the function `rmse` according to its docstrings.

Your code should pass `test_rmse`.

2. Implement the function `evaluate_models_testing` according to its docstrings.

This function is very similar to `evaluate_models_training`, except that you should report the RMSE in the title rather than the R2 value. You also do not need to include SE/slope.

Part 6B: Predicting

Now that we have a method for evaluating models’ performance on test data, we are going to use our models to “predict” the future.

(i) Generate models based on training data

First, we want to generate regression models based on the training data.

**Write your code for Problem 6B under** `if __name__ == '__main__'`.
  1. Use the `Dataset` class to generate a **training set** of the **national annual average temperature** for the years in `TRAINING_INTERVAL`.
  2. Fit the training set to polynomials of degree 2 and 10 with `fit_models` and plot the results with `evaluate_models_training`.

In **ps5_writeup.pdf**, complete the following sections:
  • Training Data, Degree 2 Plot
  • Training Data, Degree 10 Plot
  • 6.1 How do these models compare to each other in terms of R^2 and fitting the data?

(ii) Predict the results

Now, let’s make some predictions, and compare our predictions to the real average temperatures from 2000-2016.

**Continue writing your code for Problem 6B under** `if __name__ == '__main__'`.
  1. Use the `Dataset` class to generate a **test set** of the **national annual average temperature** for the years in `TESTING_INTERVAL`.
  2. Evaluate the predictions of each model obtained in the previous section and plot the results with `evaluate_models_testing`. Read the docstring carefully to make sure you use the function correctly.

In **ps5_writeup.pdf**, complete the following sections:
  • Test Data, Degree 2 Plot
  • Test Data, Degree 10 Plot
  • 6.4 Which model performed the best and how do you know? If this is different from the training performance in the previous section, explain why.
  • 6.5 If we had generated the models using the data from Problem 4B (i.e. the average annual temperature of San Francisco) instead of the national annual average over the 22 cities, how would the prediction results on the national data have changed?

Hand-In Procedure

1. Save

Save your solution as ps5.py and ps5_writeup.pdf. DO NOT submit a .doc, .odt, .docx, etc.

2. Time and Collaboration Info

At the start of ps1.py, in a comment, write down the number of hours (roughly) you spent on the problems in that part, and the names of the people you collaborated with. For example:

  # Problem Set 5
  # Name: Jane Lee
  # Collaborators: John Doe
  # Time: 8 hours

3. Sanity checks

After you are done with the problem set, run your files and make sure they run without errors.

Note: Passing all of the tests in the local tester does not necessarily mean you will receive a perfect score. The staff will run additional tests on your code to check for correctness.

7) Submission

The tester file contains a subset of the tests that will be run to determine the problem set grade.

You may upload new versions of each file until the 5PM deadline, but anything uploaded after that time will be counted towards your late days, if you have any remaining. If you have no remaining late days, you will receive no credit for a late submission.

When you upload a new file with the same name, your old one will be overwritten.

Upload your code to be AUTOGRADER HERE

 No file selected

Upload your PDF file BELOW

 No file selected