Summit Talk Part 3: Fit-For-Purpose Platforms

I got the chance to present at AWS Summit in NYC on 7/12! I’ve had several people ask me what the speech was about so I thought I’d throw together a few blog posts that walk through the talk. I’m going to break it up in to three posts:

In the first post I covered the common fears that I hear from CIOs when it comes to adopting more cloud. In the second post I dug in to three conceptual things you can do with your cloud transformation to address the fears that come up around security, cost, and effective transformation. In this last post, I want to talk about the high-level architecture that we’ve been putting in place with clients.

Our architecture focuses on a set of fit-for-purpose platforms.

In the previous post I talked about the importance of not seeing the cloud as a single place. That’s what this architecture is designed to solve. Most organizations use the cloud for a variety of different applications that can’t all be served off of the same platform… but too many still thinking of the cloud as a single platform. Often one where they need “a landing zone”. While every company is different, this slide talks about 5 different types of platforms we have commonly seen deployed at our clients:

  • Cloud Native Accounts – These are for the applications that are being rewritten entirely and will be written and deployed by “DevOps” teams that know how to manage their own infrastructure. We use a cloud vending machine and a set of cloud formation templates to provision these accounts (typically separate ones for dev, test, and prod). Typically in Test and Prod no humans have access to these accounts. All deployments must be done from the pipeline and all infrastructure should be part of those deployments. This gives the highest level of flexibility to sophisticated teams so that they can innovate. Before leveraging this model it is important to have quality, security, and compliance scanning as part of the pipeline and potentially chaos engineering implemented in test or prod.
  • SAP Accounts – I used SAP in this example slide but this really could be anything. The critical part here is that whatever is in this account is managed by an AMS vendor. For example, Kyndryl offers a Managed SAP Service and a Managed Oracle ERP Service that is completely automated and can deploy entire environments quickly and manage them extremely cost effectively. These managed solutions are likely NOT built with the same tools that you use in the rest of your environments and may not even use the same kind of infrastructure and middleware. For this reason, we encourage customers to think of them as a black box but to put them in individual accounts where they are micro-segmented and the network traffic can be controlled. This is why they sit on top of the same account vending machine and CFT automations as the Cloud Native Accounts.
  • The remaining three platforms are traditional platforms that will not become multiple accounts (there are some exceptions here for subsidiaries or customer accounts), but are instead platforms that the workloads can be hosted on. You will notice a lot more pink in these areas, that’s because centralized IT takes a lot more responsibility for the IT and avoids the necessity of creating true “DevOps” teams. I know some of the cloud faithful are rolling their eyes at me right now… but in the enterprise there are always going to be cases where the value of transforming is not sufficient to cover the cost of transforming (for example if you’re planning to retire an application) or where transformation is impossible (for example a COTS application that must be hosted on specific types of servers). The platforms we see most often are:
    • Centralized Container Platform – There can be a lot of value in moving an application from running on App Server VMs to running on containers in a Kubernetes cluster (cost reductions, enforced consistency, rolling updates, increased availability). This is usually not a complete rewrite of the application and the team still has databases, load balancers, file servers, etc… that are not “cloud native”. This centralized platform gives application teams that are only partially transforming to containers a place to land.
    • Migration Platform – This is the least transformed environment. It is for application teams that want to continue to order servers out of a service catalog and get advice on them from the infrastructure team. You can almost think of it as your “datacenter in the cloud”. There will be significant efficiencies that can be gained here with cloud automation… but the user experience will remain similar to on-prem (and consequently the team can remain similar).
    • Mainframe Platform – We have many customers that still have on-premise mainframes they are looking to retire (we have lots of opinions on how/whether to do this… but that’s for another blog post). One option that we have seen customers use is to port these applications to Java. These new java apps still require services like a console service and a shared file server to function, so we recommend standing up these support services as part of a platform to support them.

This is what we mean when the cloud isn’t “one place”. It needs to be a set of fit-for-purpose platforms that are aligned to your workloads. There’s a lot of art and a little science to selecting your platforms. It’s easy for some architects to end up with too many and avoid giving app teams the freedom they need and for others to leverage too few and end up not giving those same app teams they support they need from centralized IT. We work with organizations to setup an Agile Product Management group within the infrastructure team that can define that market segmentation and the platforms to support it… but that’s another blog post all together.

Summit Talk Part 2: A Cloud Transformation Program That Gives Confidence

I got the chance to present at AWS Summit in NYC on 7/12! I’ve had several people ask me what the speech was about so I thought I’d throw together a few blog posts that walk through the talk. I’m going to break it up in to three posts:

In the first post I covered the common fears that I hear from CIOs when it comes to adopting more cloud. In this post I’m going to dig in to three conceptual things you can do with your cloud transformation to address the fears that come up around security, cost, and effective transformation.

The cloud is not ONE place

The first point that I made is that the cloud is not ONE place. The analogy that we used was imagine being asked, “Do you want to go on a trip?” A good trip for me and a good trip for you are likely very different. Workloads are very different and you can’t put them all in the same place. This is particularly true in enterprises that have been running long enough to have brittle legacy machines and code that’s not worth refactoring. It’s popular to say that everything should be cloud native and there should never be tech debt, but in the enterprise we know we’re going to need to create environments that have to run workloads that can’t scale horizontally… maybe even some that no how to run CICS or COBOL. To make cloud transformations more successful, you must establish this up front and build fit-for-purpose platforms for each. These will address the varied architectures necessary to optimize security and cost in the cloud.

It’s not just the architecture, but the services that must be different.

The second point that I made is that this is not just about place, it’s also about the services. The analogy here is to imagine that we all need help writing English good well, but we all need different kinds of help. If you’re a fluent English speaker maybe all you need is spell check. If you’ve only been speaking English for 6 months you may want a tutor. Similarly, some app teams will want to provision infrastructure as code from their pipelines while others will prefer to order manually via a service catalog (maybe even with some architecture consulting). Too many companies try to make a single service management “plane of glass” and end up stifling innovation in some places and not preventing vulnerabilities and overspend in others.

Finally, I counselled the architects in the room that they have to think of their cloud transformation as never over. The analogy here being, imagine what would have happened if the automotive industry had stopped developing cars as soon as they had something that met the minimum definition. We’d still be driving cars with 30HP and 12mpg.

Cloud Transformation is Never Over

For cloud transformation, it’s best to imagine this as a graph with the Y axis being the value of the capabilities you provide to app teams and the X axis being time. You never expect those lines to plateau, why would you expect your cloud transformation to be “over”?

In the next post I will break down the architecture implied by “The cloud is not one place.”

Summit Talk Part 1: My CIO Doesn’t Do Enough Cloud

I got the chance to present at AWS Summit in NYC on 7/12! I’ve had several people ask me what the speech was about so I thought I’d throw together a few blog posts that walk through the talk. I’m going to break it up in to three posts:

  • Part 1: My CIO Doesn’t Do Enough Cloud
  • Part 2: A Cloud Transformation Program That Gives Confidence
  • Part 3: Fit-For-Purpose Platforms
Wondering what all of the comments all over this are about?
That’s my team and AWS’ team trying to agree on pictures, stats, and context that made my point about reasonable cloud fears without offending AWS… who apparently would prefer we make cloud sound so easy that workloads practically fall in to it.

The scariest part of this whole ordeal was that the presentation was intended to be targeted at Architects! I barely get to touch a keyboard anymore and the place is going to be swarming with hands on folks that actually know how to make AWS do all kinds of amazing things. I decided I would help them all out by telling them WHY the CIO in their lives is always so afraid to let them use AWS for more and more interesting things.

Quick Aside: Pretty much everything have to say is just as true of any cloud… don’t tell AWS, but I regularly have the same conversation about Azure.

The slide above is actually my second slide. In the first one, I explained that there are millions in opportunities in the cloud; places where companies could be spending less or growing more by leveraging cloud. This slide is geared at explaining what CIOs are reading in their industry rags that’s making them scared to move to cloud more aggressively. Mostly these come in three categories mapped to the stats above:

  1. There’s no way to go to the cloud without transformation… and transformations often fail. The culture and people changes are hard. If your IT team is doing well today (or your CIO is just a couple years from retirement), it may not be worth the risk of trying to undergo this transformation.
  2. Most CIOs that I talk to have their favorite story about how expensive the cloud can be. There’s a state government that after their first of twelve weekend application migrations took one look at their bill, migrated everything back and cancelled the program. I mean, AWS is only 13% of Amazon’s revenue, but 56% of its profit! If you don’t plan for it and don’t do cloud well, it will get costly.
  3. Security is also an issue. I hate people who say the cloud is inherently more secure, the CIA is on the cloud! It is different in how it’s secured though. You either need transformed workloads that leverage zero trust or you need to reproduce all of your on-premise perimeter security in the cloud. Either way, there’s a fair bit of work in front of you and any mistakes could make something more vulnerable than it was on premise.

Those are ALL good fears. Many cloud programs will fall victim to them. The important thing is that we structure our cloud programs to avoid them.

The Next Phase of the Cloud Revolution

Data moving to the cloud is accelerating again.
Source: Faction via Zippia

5-7 years ago it seemed obvious that most companies were going to perform huge transformations that included moving most of their applications and data moving to the cloud. The percentage of workloads and data in the cloud was increasing by 5-10 percent per year. However, in 2020 we started to see that slow down before exploding out of the gate again last year. I believe this is an indicator that we’re just entering in to a robust second phase of cloud adoption driven by a fundamentally different approach to cloud.

2010-2017: Cloud Mania

I think the reason for the original slow down in cloud adoption was workloads. The major clouds were initially constructed for green field development and that’s what they attracted. Companies rewrote their e-commerce, web, and mobile applications to take advantage of what the cloud was good at (big dev teams running their own opps). They also bought SaaS platforms that made sense to replace some of the on-premise systems that had become antiquated (for example Email, CRM and HR systems). The companies (Microsoft, Salesforce, Workday) that dominated in this new world of SaaS were able to run systems that looked similar across large numbers of clients.

2018-2021: Rethinking What Workloads are Cloud Workloads

So why did the percentage of workloads start griding to a hault? Simply, we ran out of low hanging fruit. Moving the companies mainframe, ERP, or even just old workloads didn’t make sense. Lift and Shift models provided little value and often actually increased the price of infrastructure. SaaS companies were unable to master things that SAP and others have spent decades building. Companies that were more than a few years old began to see the wisdom of a hybrid cloud model where they could use the cloud for what it was good for and keep the rest on-premise.

2022-???: New Transformation Techniques

What we’re seeing now is innovation from both the cloud providers and service providers that’s making more and more workloads good candidates for cloud. While a full list of these would be difficult, I’ll zoom in on a few that my team at Kyndryl is focusing on helping clients take advantage of:

  1. Data Platforms – Did you realize that (according to Gartner) Microsoft, Google, and AWS are now all in the “visionaries” category in “Data Science and Machine Learning”? That’s not “Cloud” specific…. that’s just all of Data Science and Machine Learning. They’re quickly catching the capabilities of on-premise focused companies like IBM, Mathworks, and Tibco. The big difference is that when you use a cloud provider for these workloads you are paying with incremental OpEx instead of a big capital investment in software and hardware. That’s making it very attractive for companies that only have a few workloads that they really want to use data science and machine learning for.
  2. Mainframe Workloads – Mainframes have traditionally been viewed as one big black box that was too dangerous to move to cloud. Kyndryl has been working with companies to setup roadmaps that actually, practically get customers off of on-premise mainframes. Some workloads might get rewritten to be cloud native apps, some might get replatformed so they can run on AWS, others might get moved to Kyndryl’s zCloud.
  3. ERP Systems in the Cloud – The ERP providers were (for obvious reasons) not the first movers in to cloud. Their customers had extremely mission critical workloads that had been customized to the point of being very brittle. Kyndryl has partnered with SAP and is able to help customers move those workloads now. The patterns have become more hardened and both AWS and Microsoft have programs to help clients. See more in our whitepaper here.

The Toughest CTO Decision in 2022

For an organization with more than a handful of development teams, the hardest decision in technology right now is to decide where the cutoff between your platform teams and your software teams should be. As with most things in the world today there are loud people making loud claims on both side of the debate but the real answer is somewhere in the middle.

In one camp there are those that are screaming “developers just want to code”. They recognize that every member of the product team is expensive and they don’t want them spending hours selecting and troubleshooting infrastructure that they aren’t expert in. They also recognize the efficiencies possible if the infrastructure team creates standards.

In the other camp there is a really solid argument for self-sufficient product teams. We have all seen software product teams that know more about what infrastructure they need than the infra team trying to work with them. This is how shadow IT starts. Creating true DevSecOps teams that are responsible for everything their app needs also allows the organization to more easily invest in (or divest of) individual product teams.

How much will I sound like a consultant trying to make a few bucks if I say, “the answer is somewhere in the middle and really needs to be determined on a case by case basis”? Let me try to reward you for reading this far by breaking down a few of the things worth considering as you make this decision:

  • Are the products that your software teams are creating infrastructure dependent (e.g. low latency, require GPUs, edge, etc…). In this case, lean toward creating product teams that build their own infrastructure. Avoid the temptation to create single workload platforms.
  • Public cloud (and a willingness to commit to one cloud instead of attempting to maintain the ability to deploy any workload to any cloud) is a short cut in this debate. They allow you to build platforms that product teams can leverage simply through IaC and GUIs that would be cost prohibitive to create on premise.
  • Consider the use of “paved roads”. Make it really easy for app teams to “just code” without taking away the ability to customize the infrastructure if required.
  • If you’re going to try to change your organization’s focus from one side of the continuum to the other, do NOT underestimate the cultural inertia.
  • The absolute worst thing you can do is to not make a decision on this. You’ll end up with platforms that are over specialized and dev teams that don’t know how to use them. You must pick where you want to land on this continuum and make sure both your dev and infra teams are funded and motivated accordingly.

Dev Teams are Blocking Infrastructure

I recently took the job within Kyndryl of helping to establish and expand our Application and Data Consulting Practice. Kyndryl is known as an infrastructure company, there’s no way around that and I got lots of questions about why anyone would want to run the software portion of a company so focused on infrastructure. The truth is, for the first time in decades, the developers need the help!

In considering whether to take the job, I recalled a moment a few years ago while I was still working for a big bank with thousands of developers. I had made the switch from working on the software development side of the house to running the implementation of this bank’s Kubernetes clusters. I distinctly remember a morning when a young engineer and his manager came in to my office with a question about a support ticket they had received. One of the application teams had entered a ticket for our monitoring team to install a performance monitoring agent on a particular container. They gave the container’s full name and included approval from their management team for us to log in to the container for the install.

The application team had the ability to modify their own docker file to install the agent. Further, the fact that the application team wanted to software installed meant they had missed that containers were immutable by design and the value of having their container image and dockerfile stored and versioned. I realized then that the tables had turned. For the first time since punch cards gave way to Cobol and Assembler, the infrastructure teams were not holding back the development teams.

Of course it’s not universally true. The most advanced dev teams at most companies are still constantly challenging the infrastructure and security teams (even the cloud providers themselves) to provide more tools and technologies faster. However, there are a lot of software development teams that are not ready to make use of the advanced infrastructure that’s available to them. There are several reasons for this:

  • The teams aren’t trained on and don’t understand how to use infrastructure as code, horizontal scaling, asynchronous communication, and so many other things that are required for them to unlock the power of the infrastructure they’ve been given.
  • They are working with workloads that are stuck on legacy infrastructure like the mainframe.
  • The code they are working with is too ingrained in legacy development models.
  • The data is too disorganized and not secure enough to move to the cloud.
  • ERP or COTS workloads won’t allow them to leverage more advanced infrastructure.

My new team in Kyndryl is focused on helping your development team overcome these challenges and unlock the value of the infrastructure now available to you.

Machine Learning Holiday Project Part III: Using the Model to Predict Games

In this post I’ll cover creating the actual machine learning model to predict bets and how it worked for the fist slate of games. If you haven’t read the background on this project, I’d point you back to my first post in this series where I described the point of the holiday project.

Creating the Model

The first step is made miraculously easy by AWS Sagemaker. I needed to run the data I described gathering and cleaning in the previous post through AWS Sagemaker’s AutoPilot. I took a beginner course in ML at the beginning of the holidays before embarking on this project, and I learned enough to know that it would take me a year to do the data transformation, model building/testing, and model tuning that AWS SageMaker can do in a couple hours. I simply pointed at the problem and let AWS try 100 different models for each of the four questions (should I make an Over Bet? Under Bet? Bet on the Home Team? and Bet on the Away Team?) with data from all of the games from this season.

The winner in all four cases was an XGBoost algorithm. I’ve included both the model details and the metrics I got back above. As you can see, the F1 score for the classification got to .994. In a model designed to measure something so luck intensive, this is an obscenely high score. I think it can be explained by the fact that I had to duplicate some of the data since I didn’t have enough data to meet SageMaker’s minimums. The model almost certainly over tuned itself to criteria that aren’t actually as predictive as you’d think. If it manages to pick 99.4% of the games, I’ll be retired soon.

Deploying and Running the Model

Based on the lack of online literature on how to actually deploy/use models in SageMaker, you’d think it would be the easiest part. I did NOT find it to be easy. It’s the kind of thing that once you’ve done it a few times, I’m sure it becomes simple. However, for me, on my first time creating an AI it was anything but.

The main problem I ran in to was on deploying a model I could actually use later. I knew from the beginning that I was only going to want to use the model periodically and so I wanted to deploy it in a way where it could run cheaply. When I discovered that “Serverless Endpoints” were available I was excited! Imagine if I could deploy my model in such a way that I’d only be charged to use it the 15 times per week I actually need it without spinning up and shutting down instances! I looked at the picture above labeled “Details on the Model” and noticed that it had three different containers to be provisioned. I picked the middle one since it’s input/output was CSV and created a serverless endpoint. For under bets and home games this gave me gibberish results. Instead of picking 1 or 0 (bet or don’t) the model returned decimals. The other two models didn’t work at all. I tried recreating the models, redeploying the models, looking for information on how to interpret results. All of this assumed I was messing something up somewhere along the way. What I finally realized is that the three containers that made up the model weren’t “options” but all needed to work in concert. I gave up and decided to just rack up a high AWS bill and deploy the models from the “Deploy Model” button in the SageMaker AutoPilot results. This finally worked. If you’re curious, I kept my code for deploying a serverless model… I still think it’s an awesome feature.

Another few hours wrestling with formatting the input data correctly (all the same data I collected for the training data needed to be found for the games I wanted to predict). You can find my code for formatting this data on my git repo. While the code is written in a Jupyter Notebook, you’ll notice I’m using the AWS parameter store to retrieve my login for my score provider, the notebook has been written to only predict games that start in the next 30 minutes, and the playbook actually adds the bets directly to my database. This was all done because I am going to be turning this in to a Lambda function later in the week so that the BOOK-E can play in the league without any human intervention. More on this in another blog post.

I did get a few games where I got conflicting results. For example, places where I should place both a bet on the home team and on the away team. Whenever this happened, I just chose not to make a bet (you can see this in the python code). I only got the model running just before the 1pm games, so I could only make one prediction (on the Titans). In the 4pm games I had the algorithm running and WALL-E’s picks looked like this:

Image

How Did BOOK-E Do?

Actually pretty good. Overall he was 6-2. There’s an almost 15% chance that a coin flip would have been that good in only 8 games though. You’ll just have to let me keep you posted.

Machine Learning Holiday Project Part II: Loading the Data

If you’ve done any reading on AI/ML you’ve probably heard someone say that the real challenge is collecting and organizing the data. That discussion is usually about finding good data, but I can tell you that it’s also a bit tricky to get data that you have access to organized enough for ML algorithms to run. This is especially true when you’re learning Python and Panda for the first time. Since this is just a learning experience for me, I cut myself off at about 10 hours of data gathering and sorting.

The big decision I had to make before creating the data is what should I make the “target” value. I could have either taken a direct path and asked the model to predict whether we should make a particular bet or I could take an indirect path and ask the model to predict what the score in the game would be and then derive whether the bet would be smart. I chose to take a direct path. I will explain this further below, but I have some data that relates to the actual bets and not the game. For example, I have data on how many people from my league have made a particular bet.

Another problem/issue with my data was that in order to create this “MVP” I used only 2021 data (data for previous seasons is harder for me to obtain since I delete most of the data out of LTHOI at the end of the season). This means that through week 16 I only have 208 data points. The SageMaker AutoPilot requires 500 data points. In order to solve this, I logged each game that I had three times. While this trick will let me process the data, it will make things like which two teams are playing a little bit too predictive.

As I am writing this the AI models are currently running, so I have no idea whether any of these have proven useful. Here are the data points I’ve given the model, how I gathered them, and what I’m hoping to get from them. When the model is done running, I should be able to add information about how actually predictive it is. I have also posted the Jupyter notebook that I used to gather the data to my git repo. In the notes below I tell you where in the code I gathered the data. At the bottom you can see the graph that AWS provides of how each field impacted the inferences both for over bets and bets on the home team.

  • The teams that are playing in the game came with my base data. When you watch shows about gambling you’ll always here statistics like, “The Steelers have never failed to cover when more than 5 point underdogs.” I am highly skeptical that individual teams help predict outcomes of games independent of their statistics. However, with the duplicative data I expect this to end up being a key indicator.
    • Source: This comes with the base data about the games from mysportsfeeds.com
    • Section of Jupyter Playbook: 3, 4, 5
    • Actual Impact: Because
  • Which team is the home team. Since I have the data in front of me, I can tell you that on average home teams win by 1.2 points this season. I assume that will play in to the model in some way. I could also see with the previous data point that certain home teams have a bigger advantage than others.
    • Source: This comes with the base data about the games from mysportsfeeds.com. There is also a field that contains information about whether the home team is actually playing at their home field. For example when the NFL played a game in London. Technically there was a “home” team, but the venue did not have allegiance to the home team.
    • Section of Jupyter Playbook: 3, 4, 5
    • Actual Impact:
  • The line and over/under line that I used in LTHOI.com. These are produced by the oddsmakers and are designed to make the game 50/50. The line is in terms of the home team, for example if the home team is favored by 8 points I will have an 8, if they are 8 point underdogs I will have a negative 8. The lines continue to shift over time, but in order to make LTHOI.com less confusing, I freeze the lines at midnight the night before the game. I doubt this will have much impact on the outcome, but I could imagine that sometimes bookmakers have tendencies that could be exploited.
    • Source: This is retrieved from the database of my LTHOI game. I used the boto3 SDK to access that database and pull the information.
    • Section of the Jupyter Playbook: 5
    • Actual Impact:
  • The average points scored and points against for each team. I calculate this by cycling through each team’s previous games and adding them up. There might have been some fancy data science way to get these together by combining spreadsheets, but I’m still more of a developer than a data scientist!
    • Source: This data was pulled from the mysportsfeeds.com statistics API.
    • Section of the Jupyter Playbook: 6
    • Actual Impact:
  • The number of people in my league who made each type of bet (over, under, home team, away team). I am thinking there may be something interesting here in the wisdom of crowds. Also, if there is news or injuries that the model doesn’t capture this will capture part of it.
    • Source: This data is available from the LTHOI table on bets. Unfortunately, I use a dynamodb and a very flat database so there’s a lot of expensive querying in here. If I keep using this AI model, I may have to add an index that will allow me to query this more cheaply.
    • Section of the Jupyter Playbook: 7
    • Actual Impact:
  • The final line for the game at kickoff. Since LTHOI.com freezes the line at midnight before the game starts, there are sometimes factors that cause the line to move significantly (a player is injured or sentiment shifts). Some of the people in my league like to focus on this and others like to ignore it. We’ll let the artificial intelligence decide whether it is important.
    • Source: This data is available from the ODDS feed of mysportsfeeds.
    • Section of the Jupyter Playbook: 8

After creating this data, I used a separate Jupyter notebook to create the actual training data. It’s not as exciting as choosing which data to use, but you can find it on my github here. I decided to make the AI have four separate models that will make a binary choice on each bet. My intention is then to interpret the results and only place a bet if the models agree.

Machine Learning Holiday Project Part I: Overview

Why do a Big Data Project over the Holiday?

In 2021 the machine learning market was a little over $15B. That is projected to increase 10x between now and 2028. It’s the fastest growing area of technology (think mobile 10 years ago) and therefor it is top of mind for my clients. In addition, the sophisticated (read as expensive) hardware, software, and staff required to do on-premise, original Machine Learning is cost prohibitive for many current companies. I believe that, increasingly, “access to the hardware and off-the-shelf software that are provided by the hyperscalers” will become one of the primary reasons clients begin or accelerate their cloud journey. Right alongside “closing a datacenter” or “decreasing time-to-market” or “increasing availability.”

I’m certainly not new to creating cloud environments to support machine learning. I have created several kubernetes clusters and cloud environments across multiple clients with the explicit goal of supporting their AI/ML or Big Data efforts. In spite of that, I had little knowledge of what actually happened in those environments. With that in mind, I decided to embark on building an AI based “player” for the fantasy/gambling app that I already use to keep my hands-on skills sharp.

Introducing Book-E the robot gambler.

As many of you know, I currently run an “app” that lets my friends and I keep score on our football predictions. It’s described reasonably well on the homepage (https://lthoi.com/). The TLDR version is that it allows players to chose wagers that should have even odds (they are coin flips) and then forces each of the other players in the game to take a portion of the other side of the wager. So, our AI/ML “player” in the game will have to pick which over/under and spread position bets they want to make each week. In order to have some fun with this, will call our AI/ML player “Book-E”

Book-E (assuming I can finish the project) will do a few things:

  1. Keep an up-to-date data set of all of the relevant football games and the data about them.
  2. Use machine learning to create a “model” of what kinds of bets will win.
  3. Evaluate each game just before betting closes (to have the best data) and pick which bets (if any) to make.

What tools/training am I going to use?

I’m going to have a lot to learn to complete this project! I will need to gather the data, to process the data in to data set(s) that can be used for machine learning, to create and then serve a machine learning model, and (finally) to integrate that model with my current game so that we have a new “player”.

Given my focus in 2021/2022 on AWS, I’m planning to focus on AWS technologies. I plan to leverage all of the AI technology in SageMaker for capturing the data and creating/serving the machine learning model. Also, since my application is AWS based (a set of lambdas, dynamodb tables, SQS queues, and an API Gateway), I will be adding a few lambdas and cloudwatch triggers to make the AI Player actually place “bets” and update models without the need for human intervention.

For the aggregating of the data, I am going to be using Python and Jupyter Notebooks as my workspace. Since I’m planning to be very AWS dependent I’m going to use the AWS Sagemaker Studio as my IDE. The data will come from existing tables in my application (which I will access using the AWS SDK known as boto3) and from the company I use to provide my scores/data for the game (which I will access through the Python wrapper they provide).

For creating and serving the actual machine learning model, I plan to use AWS SageMaker. Specifically, I’m really excited about the AWS Autopilot functionality which will select the best machine learning model for me without me having to be a data scientist.

This is going to require some training! At the onset of this project, I do not know much about AWS Sagemaker, AWS Sagemaker Studio, Python, the AWS SDK for Python, Jupyter notebooks, or machine learning! I identified the following Udemy courses that I plan to go through:

  • AWS SageMaker Practical for Beginners | Build 6 Projects – This is my primary course. It does a great job introducing the concepts of machine learning, the different types of models, and the ways to evaluate models. Even better, it does this using AWS Sagemaker and Sagemaker Studio as the tools.
  • AWS – Mastering Boto3 & Lambda Functions Using Python – This course was a great way to get started with both Python in general and with Boto3 (which is the AWS SDK for Python). If you’re a bit of an idiot (like me) and jumping in to this project without background in Python, let me HIGHLY recommend chapter 5 which covers a lot of what you need to know about Python generally in 58m. This would probably only be a sufficient overview if you have a decent amount of programming experience.
  • Data Manipulation in Python: A Pandas Crash Course – This course was great for an introduction to Pandas (a library in Python that’s useful for data manipulation/review) and Jupyter notebooks. While these are both touched on in the first course I mentioned above, if you’re going to actually do some of your own coding, you’ll need a more in-depth review.

Clever Idea: Serverless, Cloud Native CI/CD

If you’ve met me for more than a few minutes you’ve heard me talk about my passion project, Leave the House Out of It (lthoi.com). If you’ve really paid attention to my blog posts you’ve caught that a couple years ago I rearchitected the app to move to an event-based, serverless architecture on AWS. After a year of not doing very much with the project I’ve had the itch to make some upgrades (more on this next year). Before I did, I wanted to upgrade the CI/CD pipeline I use to manage the code.

While I had moved away from containers/EKS, I did keep the containerized Jenkins that had been deployed alongside my code on the EKS cluster. I got an EC2 server, installed docker, and deployed the image there. Unfortunately, on an EC2 server Jenkins quickly became both disproportionately expensive and pretty slow. The cost was due to the inefficiency of running a Jenkins server for an app you deploy infrequently. In fact, because the app was all serverless and low volume, I was actually paying more for my Jenkins server then for all of the rest of my AWS charges combined. In spite of the cost, the performance was pretty terrible. Jenkins lost access to deploy agents across my cluster and instead churned away on an under powered EC2 server. This caused larger runs of the pipeline to take upwards of 9 minutes.

Over the last few weeks, I’ve taken the final step to the AWS native world and adopted CodeBuild, CodeDeploy, and CodePipeline to replace my Jenkins CI/CD pipeline. My application has 5 Cloud Formation stacks (5 separate Lambda functions along with associated API gateways and DynamoDB databases) and an S3 bucket and CloudFront implementation that hosts the Angular UI. I ended up with 6 separate CodeBuild projects, one to build and unit test each of the lambdas and one to build the UI. The one that builds the UI I took a shortcut and simply used the build service to also deploy. For the 5 lambdas, I wrapped them in a CodePipeline along with AWS CF Deploy jobs for each.

The only tricky part I found was that I did not want to refactor my lambdas in to an “Application” so I could not use AWS CodeDeploy out of the box. That made it difficult to use the artifacts from AWS CodeBuild. The artifacts are stored as zip files meaning I can’t directly reference them from the CloudFormation for Lambda which is expecting a direct address of where it can find the .jar file (I wrote the Lambdas in Java). I got around this by having two separate levels of “deploy”. In the first one, I use an S3 “action provider” to unzip the build artifact and drop it in an S3 bucket that I can reference from the CloudFormation. The resulting code pipeline looks like this:

The results are compelling on several fronts:

  1. I was able to shutdown the EC2 instance and all the associated networking and storage services. It should save me a total of ~$50. It looks like on normal months I’ll be in the free tier for all of the Code* tools. So it will literally be $50/month right in my pocket. I expect all but the biggest software development shops are going to better with this model than with dedicated compute for CI/CD.
  2. In my case, I also sped up the process considerably. I had been running full build and deploys in around 9 minutes. This was due to the fact that I was using one underpowered server. AWS CodeBuild, is running 5 2 CPU machines for build and running my deploys concurrently. That has dropped my deploy time to about 1.5 minutes. (note: In fairness to Jenkins, I could have further optimized Jenkins to use agents to deploy the AWS stacks in parallel… I just hadn’t gotten around to it)
  3. The integration with AWS services is pretty nifty. I can add a job to deploy a particular stack with a couple of clicks instead of carefully copying and pasting long CLI commands.
  4. In addition, this native integration makes it easier to be secure. Instead of my Jenkins server needing to authenticate to my account from the CLI, I have a role for each build job and the deploy job that I can give granular permissions to.

There are very few negatives to this solution. It does marry you to AWS, but if you have well written code and a well documented deployment process it wouldn’t take you long to re-engineer it for Azure DevOps or back to Jenkins. It’s definitely going to be my way forward for future projects. Goodbye Groovy.