Reading and pre-processing the datasets
The aim is to analyse the relationship between worldwide earthquakes, tsunamis and tectonic plate boundaries. This aim will be met by completing the following objectives:
- Mapping all the affected areas,
- Number of occurrences of earthquakes with different magnitude ranges,
- Severity of earthquakes,
- Mapping highly affected areas based on the magnitude,
- Which month has the highest earthquake occurrences?
- Which year has the highest earthquake occurrences?
- Visualizations of earthquakes and tsunami.
The seismic analysis is divided into two parts: reading and pre-processing the datasets, visualization techniques and damage grade prediction.
1. Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.basemap import Basemap
import folium
from folium import plugins
import datetime
import plotly.express as px
import pandas_profiling
import matplotlib.pyplot as plt
2. Importing datasets
Datasets:
- Earthquake data (1965–2016) downloaded from https://www.kaggle.com/usgs/earthquake-database
- Tectonic plates data downloaded from https://www.kaggle.com/cwthompson/tectonic-plate-boundaries
- Tsunami data (2000–2017) downloaded from https://www.kaggle.com/noaa/seismic-waves
Earthquakes Data
Reading a comma-separated values (CSV) file into DataFrame.
earthquakes = pd.read_csv('database.csv')
and visualizing data using plotly.express library.
fig = px.density_mapbox(earthquakes , lat='Latitude', lon='Longitude', z='Magnitude', radius=5,
center=dict(lat=0, lon=180), zoom=0,
mapbox_style="stamen-terrain", title = 'Earthquakes around the world')
fig.show()
Tectonic plates data
Reading data from CSV,
tec_plates = pd.read_csv('Tectonicplates.csv')
creating a visualization of tectonic plates using plotly library:
fig = plt.figure(figsize=(14, 10), edgecolor='w')
m = Basemap(projection='cyl', resolution='c',
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )m.scatter(tec_plates['lon'], tec_plates['lat'],s=4, color='green')m.drawcountries(color='gray',linewidth=1)
m.shadedrelief()
plt.title("View of Tectonic plates ")
plt.show()
Tsunami data
Reading tsunami data,
tsunami = pd.read_csv('sources.csv')
importing the second table from the source here and selecting two columns for the analysis,
waves = pd.read_csv('waves.csv')
waves = waves[['SOURCE_ID', 'DISTANCE_FROM_SOURCE']]
3. Preprocessing data for the analysis
Cleaning earthquake data
Selecting columns from data frame and
earthquakes = earthquakes[['Date', 'Time', 'Latitude', 'Longitude', 'Depth', 'Magnitude', 'Type']]
checking lengths of dates to see if there are any differences.
lengths = earthquakes["Date"].str.len()
lengths.value_counts()
As we can see, the data frame contains 3 rows with wrong dates (24 characters).
wrongdates_index = np.where([lengths == 24])[1]
print(wrongdates_index )
Row index with wrong dates: [ 3378 7512 20650].
earthquakes.loc[wrongdates_index]
earthquakes.loc[3378, "Date"] = "02/23/1975"
earthquakes.loc[7512, "Date"] = "04/28/1985"
earthquakes.loc[20650, "Date"] = "03/13/2011"
earthquakes.loc[3378, "Time"] = "02:58:41"
earthquakes.loc[7512, "Time"] = "02:53:41"
earthquakes.loc[20650, "Time"] = "02:23:34"
All the wrong dates have been corrected.
lengths = earthquakes["Date"].str.len()
lengths.value_counts()
Creating a DateTime column from the date and time columns,
earthquakes['Datetime'] = earthquakes['Date'] + ' ' + earthquakes['Time']
earthquakes['Datetime'] = pd.to_datetime(earthquakes['Datetime'])earthquakes.head()
extracting year and month names from DateTime columns,
earthquakes['Year'] = earthquakes['Datetime'].dt.year
earthquakes['Month'] = earthquakes['Datetime'].dt.month_name()
earthquakes.head()
and selecting columns from the data frame and checking if there are any nan values.
earthquakes = earthquakes[['Datetime', 'Latitude', 'Longitude', 'Depth', 'Magnitude', 'Type', 'Year','Month', 'Date']]
earthquakes.head()
check_nan_in_df = earthquakes.isnull().any()
print (check_nan_in_df)
Cleaning tsunami data
The first step was to reset the index and select columns from the data frame.
tsunami.reset_index(drop=True, inplace=True)
tsunami = tsunami[['SOURCE_ID','YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'FOCAL_DEPTH', 'PRIMARY_MAGNITUDE','LATITUDE', 'LONGITUDE', 'COUNTRY', 'CAUSE']]
After that, dropping nan values in the chosen subset of columns,
tsunami = tsunami.dropna(subset=['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']).dropna()
converting columns to integers,
tsunami.MONTH = tsunami.MONTH.astype(int)
tsunami.DAY = tsunami.DAY.astype(int)
tsunami.HOUR = tsunami.HOUR.astype(int)
tsunami.MINUTE = tsunami.MINUTE.astype(int)
tsunami.head()
creating a new column called cause_name with the categorized values,
causes = {0:'Unknown',
1:'Earthquake',
2:'Questionable Earthquake',
3:'Earthquake and Landslide',
4:'Volcano and Earthquake',
5:'Volcano, Earthquake, and Landslide',
6:'Volcano',
7:'Volcano and Landslide',
8:'Landslide',
9:'Meteorological',
10:'Explosion',
11:'Astronomical Tide'}tsunami['CAUSE_NAME'] = tsunami['CAUSE'].map(causes)
tsunami.head()source of categories: Historical Tsunami Database (National Center for Environmental information
picking data only for earthquake causes,
tsunami_type = tsunami[tsunami['CAUSE_NAME'] == 'Earthquake']
checking if there are any nans,
check_nan_in_df2 = tsunami_type.isnull().any()
print (check_nan_in_df2)
and finally creating a date column combining day, month year columns.
cols=["MONTH","DAY","YEAR"]
tsunami_type['DATE'] = tsunami_type[cols].apply(lambda x: '/'.join(x.values.astype(str)), axis="columns")
tsunami_type = tsunami_type[['SOURCE_ID', 'DATE', 'PRIMARY_MAGNITUDE', 'COUNTRY', 'LATITUDE', 'LONGITUDE' ]]
tsunami_type.head()
The second part, i.e. visualization techniques and damage grade prediction, will be published shortly.. :)