MLOps Series: MLflow และการสร้าง Tracking Server บน AWS EC2 & S3

6 min readJul 13, 2022

สวัสดีครับวันนี้เราจะมาแนะนำการใช้เครื่องมือสำหรับทำด้าน MLOps ในส่วนของ Model Experimental หรือการ Tracking Model กันนะครับ โดยเครื่องมือที่จะใช้นี้นั่นคือ MLflow นั่นเอง

Workflow การใช้ MLflow ในขั้นต้อน Model Development ก่อนจะเป็น Model Deployment

MLfLow คืออะไร ?

MLflow นั้นเป็นเครื่องมือที่เอาไว้ใช้ใน Machine Learning Lifecycle ตัวหนึ่งซึ่งทำให้การทำ Machine Learning ในขั้นตอน Model Development มีจัดการได้ง่ายและมีระบบระเบียบมากขึ้น ที่สำคัญคือทำให้การทำงานร่วมกับ Data Scientist คนอื่นๆในทีมสะดวกขึ้นกว่าเดิม อีกทั้งตัว MLflow ยังเป็น Open Source ที่ฟรีและสามารถ Integrate เข้ากับ Library มากมายเช่น Scikit-Learn, XGBoost, Tensorflow/Keras, Pytorch, FastAI และอื่นๆ อีกมากมาย

Overview ของ MLFLow ว่าสามารถใช้ทำในด้านไหนของ MLOps บ้าง

จากภาพด้านบนจะสังเกตได้ว่า MLflow นั้นแทบจะครอบคลุมทุกส่วนของ MLOps เลย แต่ในบทความนี้เราจะเน้นในส่วนที่เป็น MLflow Tracking สำหรับทำ Model Experimental นะครับ

MLfLow Tracking

เป็นการสร้าง Tracking Server หรือ UI สำหรับเก็บรวบรวม, แสดงผลการเทรนโมเดลในแต่ละ Experiment เพื่อที่จะสามารถดูได้ว่าพารามิเตอร์ไหนให้ผลลัพธ์ที่ดีที่สุดไม่ต้องพิมพ์เก็บข้อมูลใน Excel เอง

UI ที่แสดงถึงรอบการรัน, พารามิเตอร์, เมทริกซ์ และอื่นๆ

1. Concept

โดยหลักๆ แล้วในการรัน Experiment แต่ละครั้งนั้นตัว MLflow จะทำการ log ค่าต่างๆ ซึ่งผู้ใช้สามารถระบุค่าบางส่วนได้ ได้แก่

Code Version : Git commit hash หรือเวอร์ชันของโปรเจ็คถ้าเราใช้ Git (ต้องใช้กับ MLflow Project)
Start & End Date : เวลาเริ่มต้นและสิ้นสุดที่ใช้ในการรันไฟล์นั้นๆ
Source : ชื่อไฟล์ที่ใช้รัน
*Parameter* : Key-Value (string) โดยหลักๆแล้วจะใช้กับ Model Parameter เช่น max_depth , n_estimators และข้อมูลประกอบในการรัน เช่น Batch Size, Epochs, Image Size เป็นต้น
*Metrics* : Key-Value (numeric) เก็บข้อมูลที่เป็นลักษณะ Time Series ของการรันซึ่งหลักๆ จะเก็บค่า Loss Function, Accuracy, Precision ที่สามารพล็อตกราฟดูความเปลี่ยนแปลงได้
*Artifacts* : ไฟล์สกุลต่างๆ ไม่ว่าจะเป็น โมเดล (.h5, .pth, .pickle), รูปภาพ (.jpeg, .png), ดาต้าเซ็ต (.csv, .parquet) เป็นต้น

2. Configuring MLflow

ในส่วนนี้จะเป็นการคอนฟิก MLflow ในแต่ละส่วนกันนะครับ ซึ่งตัว Tracking Server นั้นจะมีสิ่งที่เราต้องตั้งค่าอยู่ 3 ส่วนหลักๆ ได้แก่

Backend Store : Local Filesystem, SQLAlchemy compatible DB (e.g. SQLite, PostgreSQL, MySQL)
Artifact Store : Local Filesystem, Remote (e.g. S3, Google Cloud Storage)
Tracking Server: Localhost, Remote

3. Scenario

เรามาดูกันนะครับว่าเราจะสามารถสร้าง MLflow Tracking Server ได้ในรูปแบบไหนบ้าง เพื่อให้เราสามารถเลือกใช้ได้ตรงความต้องการของตัวเรานะครับ

3.1 Scenario 1: MLflow on localhost

Backend Store : Local Filesystem (Left Image), SQL DB (Right Image)
Artifact Store : Local Filesystem
Tracking Server : Localhost
Use Case: A single data scientist participating in an ML competition.

MLflow on the localhost (right) using SQL DB to store MLflow entities

ในกรณีที่หนึ่งนี้เรามักจะใช้เวลาที่เราทำโจทย์พวก Kaggle หรือพวก Competition อื่นๆ คนเดียวนะครับ ซึ่งเราต้องการที่จะสร้าง Tracking Server ขึ้นไวๆ รันคำสั่งเดียวใช้ได้แน่นอนไม่ต้องตั้งค่าอะไรมากครับ

Using local filesystem (mlruns directory) as backend storage

// install mlflow package via pip firstmlflow ui --backend-store-uri ./mlruns/

Using SQLite as backend storage

mlflow ui --backend-store-uri sqlite:///mydb.db

เราจะสามารถเข้าถึง UI Tracking Server ผ่านลิ้ง output ได้เลยครับ

Ouput after running mlflow ui command given url path

3.2 Scenario 2: MLflow in Local Tracking Server

Backend Store : Local Filesystem / SQL DB
Artifact Store : Local Filesystem
Tracking Server : Local Tracking Server
Use Case: A cross-functional team with one data scientist working on an ML model.

MLflow on local tracking server and using local filesystem to store entities not the SQL

ในกรณีที่สองนี้เราจะสร้าง Tracking Server ไว้ในเครื่องของเรานะครับ ซึ่งจะเป็น Production Server ต่างจากกรณีแรก และสามารถดูได้แบบ Real Time ครับ

mlflow server --backend-store-uri sqlite:///mydb.db --default-artifact-root ./artifact-store

3.3 Scenario 3: MLflow in Remote Server

Backend Store : SQL DB
Artifact Store : S3 Bucket
Use Case: Multiple data scientists working on multiple ML models.

MLflow on remote tracking server with SQL backend and S3 for storing artifacts.

มาถึงกรณีสุดท้ายที่เป็นเมนหลักของบทความของเราแล้วนะครับ ในส่วนนี้มักจะใช้ในบริษัทซึ่งมีทีมที่มี Data Scientist หลายคน มีการทำโปรเจ็คและโมเดลที่หลายส่วน แต่ก็แลกมาด้วยขั้นตอนที่ค่อนข้างมากและยุ่งยากหน่อย ซึ่งสามารถทำตามขั้นต้อนนี้ได้เลยครับ

Prerequisites

ก่อนอื่นให้เราสร้างแอคเค้าท์บน AWS amazon ก่อนครับ http://aws.amazon.com

หลังจากนั้นให้เราจด Access Key และ Secret Key ซึ่งได้ที่การตั้งค่าส่วน Security Credential -> Users สร้าง User ใหม่ให้เป็น admin แล้วก็อปปี้ Key ทั้งสองเก็บไว้ครับ จากนั้นให้ไปที่ Terminal ของเราแล้วพิมคำสั่ง

// Install AWS command line interface
pip install awscli// Configure AWS credential
aws configure

1. Create EC2 VM for hosting Tracking Server

สร้าง Cloud VM โดยให้ไปที่ EC2 dashboard แล้วกดสร้าง Instance ใหม่

(Left) Search for EC2 dashboard (Right) EC2 dashboard

เลือก OS และ Spec ของ VM ที่เราต้องการจะใช้

(Left) Select Amazon Linux as VM OS (Right) Select t2.micro as instance type

สร้าง Key pair เก็บไว้ในเครื่อง local เพื่อให้เราสามารถ SSH เข้าไปใน Instance ได้

Creating key pair to connect the instance securely

กด Edit ตรงฝั่งขวาของ Network Settings เพื่อที่จะได้ตั้งค่า Network Security Group ให้ตั้งชื่อ Security Group (จะเอาไปใช้ต่อในการขั้นตอนที่ 3) และกดเพิ่ม Security Group Rule ใหม่กำหนด PORT 5000 สำหรับเข้าถึงหน้าเว็บ Tracking Server

Adding new security group rule TCP port 5000 for mlflow ui

สุดท้ายกดสร้าง Instance ก็เป็นอันเสร็จสิ้นขั้นตอนที่ 1

2. Create S3 Bucket service

ไปที่ Dashboard ของ S3 และกด Create Bucket

(Left) Search for S3 dashboard (Right) S3 bucket dashboard

สิ่งที่ต้องตั้งค่ามีแค่ Bucket name ที่ต้องไม่ซ้ำกับใครเลย เสร็จแล้วให้กด Create Bucket

3. Create RDS for MLflow backend store

สร้าง Database โดยไปที่ RDS (Relational Database Service) Dashboard

(Left) Search for RDS dashboard (Right) RDS dashboard

เลือก PostgreSQL เป็น Database Engine และเลือก Free tier

(Left) SQL DB Engine (Right) Available Template

ตั้งค่าชื่อ Master username และเลือกช่อง Auto generate a password เพื่อให้ได้ยูเซอร์สำหรับการเข้าถึง DB ได้ เลื่อนลงมาข้างล่างให้ตั้งชื่อ Initial database name และจำไว้ใช้ในภายหลังด้วย

เมื่อกดสร้าง Database เสร็จแล้วจะมีให้กดรหัสผ่านใน Credential ด้านบน

(Left) Credential Settings (Right) Initial Database Name

หลังจากรอให้ Database สร้างเสร็จ เราต้องไปแก้ Security เพื่อให้ EC2 ที่เราสร้างไว้เข้าถึงได้ โดยให้ไปที่หน้า Dashboard DB ที่เราเพิ่งสร้าง กดตรงข้างล่าง VPC Security group -> Edit inbound rule

แล้วให้กด Add rule โดยใช้ Type เป็น PostgreSQL และเลือก Source เป็น Security Name ที่เราได้สร้างในขั้นตอนการสร้าง EC2

(Left) Created DB dashboard showing Endpoint, Port, Security (Right) Adding created EC2 security group for accessibility

4. Configuration MLFlow Server in EC2

หลังจากที่เราเซ็ตอัพทุกๆส่วนไม่ว่าจะเป็น EC2 VM, S3 Bucket, RDS แล้ว ที่นี้ก็พร้อมที่จะสร้าง MLflow Tracking Server ได้สักที ซึ่งสิ่งที่เราจะต้องจดเตรียมไว้แล้วต้องมี

Access Key ID, Secret Access Key (เอาได้จาก User Credential)
S3 Bucket Name (ขั้นตอนที่ 2)
RDS DB Master Username (ขั้นตอนที่ 3)
RDS DB Master Password (ขั้นตอนที่ 3)
RDS Initial Database Name (ขั้นตอนที่ 3)
RDS Endpoint (ขั้นตอนที่ 3 หลังจากที่สร้างเสร็จแล้ว)

เมื่อเช็กแล้วว่าจดทุกอย่างที่ต้องใช้ครบแล้วก็ให้ไปที่ EC2 Instance ที่สร้างไว้เพื่อเซ็ตอัพ Tracking Serverได้เลย

(Left) Created EC2 instance dashboard (Right) Connect into instance

เมื่อกด Connect แล้วเราจะได้หน้า Terminal ของเครื่อง Instance ที่ได้สร้างไว้ ให้รันคำสั่ง

sudo yum updatepip3 install mlflow boto3 psycopg2-binary# you'll need to input your AWS credentials here
aws configure

เสร็จแล้วลองรันคำสั่ง aws s3 ls เพื่อทดสอบว่า Credential ที่ใส่ไปถูกต้อง

The command should list all of the available s3 bucket

และคำสั่งที่ใช้สร้าง Tracking Server นั่นคือ

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME

อธิบาย Parameter สิ่งที่ต้องเอาไปแทนที่

DB_USER : RDS DB Master Username
DB_PASSWORD : RDS DB Master Password
DB_ENDPOINT : RDS Endpoint
DB_NAME : RDS Initial Database Name
S3_BUCKET_NAME : S3 Bucket Name

หากเราใส่ค่าต่างๆ ถูกต้องเราจะสามารถเข้าเว็บ UI ได้ โดยใช้ URL

http://<EC2_PUBLIC_DNS>:5000 (you can find the instance's public DNS by checking the details of your instance in the EC2 Console).

EC2 Public DNS for accessing MLFlow Tracking Server

หากทุกอย่างถูกต้องเราก็จะสามารถเปิดเว็บได้หน้าตาแบบนี้

ทดลองรันสคริปนี้โดยป้อน EC2 Public DNS ของตัวเองเข้าไป (Require tensorflow, mlflow package installed via pip)

Experiment tracking using MLflow keras autolog

ที่หน้าเว็บ UI จะเห็น Experiment อันใหม่ และถ้าไปดู S3 Bucket ที่ได้สร้างไว้จะเห็นไฟล์โมเดลเก็บไว้อยู่

(Left) MLflow web after running an experiment (Right) S3 Bucket storing the model

ก็จบไปแล้วนะครับสำหรับบทความแรกสำหรับซีรี่ย์ MLOps นี้ จะเห็นได้ว่าการใช้ tool MLflow สำหรับ Model Experiment นั้นทำให้เราสะดวกสบายขึ้นมากๆ และควรใช้เป็นหลักด้วย ตัวผมเองก็เพิ่งได้ลองทำแบบจริงๆจังๆ ในด้าน MLOps เอง หลังจากนี้ก็จะมีบทความทางด้านนี้มากขึ้นก็ขอฝากไว้ด้วยนะคร้บ : ) ขอบคุณครับ