By leveraging the power of Trino, modern languages like Typescript and Starburst’s managed services, this project aims to enable Software Developers to facilitate data handling and use these integrated technologies as a dependable backend system for data driven applications. The project integrates open-source rideshare data into Starburst Galaxy, creating a web application dashboard designed to provide insights into rideshare and traffic data across major metropolitan areas. This dashboard inspired application is designed to highlight how variability—such as borough location and weather affects rideshare and traffic data points. By leveraging Trino as a powerful querying engine and integrating with Starburst Galaxy, the project showcases how modern JavaScript/TypeScript frameworks and libraries can be used to derive actionable insights from complex data sets.
- JavaScript and TypeScript: The project is primarily built using JavaScript and TypeScript, showcasing their effectiveness in developing modern, scalable web applications.
- Node.js: Utilized for server-side logic, api handling, and query management, enabling an efficient backend infrastructure.
- React: Used for building interactive and dynamic user interfaces of the dashboard, making use of its component-based architecture for a seamless user experience
- trino-js-client: For connecting to the Starburst Galaxy server and executing queries within Node.js server-side logic. Provides us the power to federate data across multiple sources to power our application
- Starburst Galaxy: Accessing data configured within Galaxy Catalogs across multiple locations
The project uses the following Kaggle datasets:
Follow these steps to replicate the project on your local machine:
-
Download Rideshare Data:
- Visit the NYC Rideshare Analysis Data page on Kaggle.
- Download the dataset files to your local machine.
-
Download Weather Data:
- Visit the NYC Weather Data page on Kaggle.
- Download the dataset files to your local machine.
-
Upload to AWS S3:
- Log in to your AWS account.
- Navigate to the S3 service.
- Create a new bucket or use an existing bucket.
- Upload the downloaded datasets to the S3 bucket.
-
Upload to Snowflake (optional):
- Log in to your Snowflake account.
- Navigate to the "Data" section.
- Create a new database or use an existing one.
- Use the Snowflake Web Interface or SnowSQL to load the datasets into Snowflake tables.
-
Discovering Schemas:
- Log in to your Starburst Galaxy account.
- Navigate to the Data section.
- Use the schema discovery feature to detect and create schemas for your datasets in the AWS S3 or Snowflake storage.
-
Creating Bronze Layer Tables:
- Create bronze layer tables using the discovered schemas. These tables will hold the raw data as it was uploaded.
-
Creating Silver Layer Tables:
- Create silver layer tables to refine and organize the data for analysis. Use the following query to create the
nyc_rideshare_fare_analysis
table:
CREATE TABLE nyc_rideshare_fare_analysis AS SELECT date, hour_of_day, day_of_week(date) AS day_of_week, CEIL(passenger_fare) AS passenger_fare, CEIL(driver_total_pay) AS driver_total_pay, passenger_fare - (driver_total_pay + rideshare_profit) AS tip_amount, trip_length, total_ride_time, business, pickup_location, dropoff_location FROM bronze_schema.rideshare_data WHERE YEAR(date) = 2022;
- Create silver layer tables to refine and organize the data for analysis. Use the following query to create the
- Open your terminal or command prompt.
- Run the following command to clone the repository:
git clone https://github.com/emilysunaryo/trino-js-demo.git cd demo_galaxy_app
- Ensure you have Node.js and npm installed on your machine. You can download them from nodejs.org.
- Install the project dependencies by running:
npm install
- Create a
.env
file in the root directory of the project. - Add the following environment variables to the
.env
file, replacing the placeholder values with your actual credentials and settings:TRINO_SERVER_URL=your-galaxy-server-url TRINO_USER=your-username TRINO_PASSWORD=your-password S3_BUCKET_NAME=your-s3-bucket-name SNOWFLAKE_ACCOUNT=your-snowflake-account SNOWFLAKE_USER=your-snowflake-username SNOWFLAKE_PASSWORD=your-snowflake-password
- Start the application by cd'ing into the
node_server
directory to initialize the trino-js library:node trino_client_server
- Start the application by cd'ing into the
demo_react_frontend
directory running:npm start
- Open your web browser and navigate to the local server address provided in the terminal to view the application.
Here are some example queries used in the project to fetch and analyze data:
const trino = require('trino-js-client');
// Trino Client connected on Galaxy Server
const auth = new BasicAuth(process.env.GALAXY_USERNAME, process.env.GALAXY_PASSWORD);
const sslOptions = {
rejectUnauthorized: true
};
const trino = Trino.create({
server: INCLUDE YOUR OWN GALAXY SERVER HERE,
catalog: 'nyc_uber_rides',
schema: '',
auth: auth,
ssl: sslOptions
});
Provided Test Query used in Project
const testQuery = 'SELECT
d.borough as borough,
round(avg(driver_total_pay), 2) as avg_driver_pay,
count(*) as ride_requests
FROM
silver_schema.nyc_rideshare_fare_analysis r
JOIN
taxi_zone_lookup.taxi_zones.zone_lookup d ON r.dropoff_location = d.location_id
WHERE
d.borough != 'Unknown'
GROUP BY
d.borough
ORDER BY
ride_requests'
Query functions to access data through Trino-js Client
const executeQuery = async (query) => {
const iter = await trino.query(query);
const data = await iter
.map(results => results.data ?? [])
.fold([], (row, acc) => [...acc, ...row]);
console.log("testing return of query on server:", data )
return data;
}
executeQuery(testQuery);